integrated 
systems  inc. 


AD  A1  39600 


MODULAR  COMPUTING  NETWORKS: 

A  NEW  METHODOLOGY  FOR  ANALYSIS  AND 
DESIGN  OF  PARALLEL  ALGORITHMS/ 
ARCHITECTURES 


HANOCH  LEV-ARI 


PREPARED  FOR: 

OFFICE  OF  NAVAL  RESEARCH 
800  NORTH  QUINCY  STREET 
ARLINGTON,  VIRGINIA  2  22  1  7 

ATTENTION:  DR.  DAVID  W.  MI2ELL 


PREPARED  UNDER: 

CONTRACT  NO.  N000 1 4-83-C-0377 


ISI  REPORT  29  • 


DECEMBER 


1983 


This  document  has  been  approves 
lor  public  release  and  sale;  its 
distribution  is  unlimited. 


OTIC  FH-t  COPY 


P.m  A-i  ''A  >i.: 


Pfv  415 '321-9773 


84  02  06  008 


CONTENTS 


Section  Pue 

1  INTRODUCTION .  1 

2  MODELING  PARALLEL  ALGORITHMS  AND  ARCHITECTURES  ....  3 

2.1  Toward  a  Formal  Definition  of  Algorithms 

and  Architectures . 3 

2.2  Modular  Computing  Networks  .  6 

2.3  Causality  and  Executions . 14 

2.4  Hierarchical  Composition  of  MCNs .  17 

2.5  Comparison  of  MCNs  with  Other  Network  Models  •  •  21 

2.5.1  Block-Diagrams  and  Finite-State  Machines  .  21 

2.5.2  Da ta -Flow-Graphs  and  Petri-Nets  .....  23 

2.5.3  High-Level  Programming  Languages  .  25 

2.5.4  Summary .  26 

3  STRUCTURAL  PROPERTIES  OF  MCNs .  29 

3.1  Numbering  of  Variables  and  Processors .  29 

3.2  Dimensionality  and  Order .  30 

3.3  Schedules.  Delay  and  Throughput .  34 

3.4  Space-Time  Diagrams .  41 

4  ITERATIVE  NETWORKS  .  51 

4.1  Properties  of  Iterative  MCNs .  51 

4.2  Hardware  Architectures  .  52 

4.3  Completely  Regular  MCNs .  61 

5  CONCLUSIONS .  65 

REFERENCES .  67 

APPENDIX  A:  PROOF  OF  THEOREM  2.2  FOR  INFINITE  MCNs .  69 

APPENDIX  B:  ADMISSIBLE  ARCHITECTURES .  71 

APPENDIX  C:  PROOF  OF  THEOREM  2.3;  MINIMAL  EXECUTIONS 

OF  FINITE  MCNs .  73 

APPENDIX  D:  ELEMENTARY  EQUIVALENCE  TRANSFORMATIONS . .  .  79 


;  /.• 


V  o  r 


:.A  ! 


4 


SECTION  1 
INTRODUCTION 


V 


Several  methods  for  modeling  and  analysis  of  parallel  algorithms  and 
architectures  have  been  proposed  in  the  recent  years.  These  inclnde 
recursion-type  methods,  like  recursion  equations,  z- transform  descriptions 
and  'do- loops'  in  high-level  programming  languages,  and  precedence-graph- 
type  methods  like  data-flow  graphs  (marked  graphs)  and  related  Petri-net 
derived  models, [1],  [2].  Most  efforts  have  been  recently  directed  towards 
developing  methodologies  for  structured  parallel  algorithms  and 
architectures  and,  in  particular,  for  systolic-array-like  systems  [3]-[101. 
Some  important  properties  of  parallel  algorithms  have  been  identified  in  the 
process  of  this  research  effort.  These  include  execut abil i ty  (the  absence 
of  deadlocks),  pipel inabil ity ,  regularity  of  structure,  locality  of 
interconnections,  and  dimensionality.  The  research  has  also  demonstrated 
the  feasibility  of  multirate  systolic  arrays  with  different  rates  of  data 
propagation  along  different  directions  in  the  array. 

The  methodologies  mentioned  above  provide  some  assistance  in  the 
analysis  and  synthesis  of  parallel  algorithms  and  architectures,  but  none  of 
them  is  flexible  enough  to  address  the  wide  scope  of  problems  that  arise  in 
this  fast  developing  discipline.  In  particular,  pone  of  these  methodologies 
is  capable  of  clearly  displaying  the  mul t ini icity  of  choices  for 
implementing  a  given  set  of  recursion  equations.  Recent  research  has 
vividly  demonstrated  this  multiplicity  by  presenting  several  distinct 
architectures  for  the  same  operation  (e.g.,  matrix  multiplication  [4],  [5], 
[7],  [10]). 

.  In  ih  is  paper  we  presents^  new  methodology  for  modeling  and  analysis  of 
parallel  algorithms  and  architectures.  Our  methodology  provides  a  unified 
conceptual  framework  that  clearly  displays  the  key  properties  of  parallel 
systems.  In  particular. 


(1)  Executabil ity  of  algorithms  is  easily  verified. 


(2)  Schedules  of  execution  are  easily  determined.  This  allows  for 
simple  evaluation  of  throughput  rates  and  execution  delays. 

(3)  Both  synchronous  and  asynchronous  (self-timed)  modes  of  execution 
can  be  handled  with  the  tame  techniques. 

(4)  Algorithms  are  directly  mappable  into  architectures.  No  elaborat 
hardware  compilation  is  required. 

(5)  The  description  of  a  parallel  algorithm  is  independent  of  its 
implementation.  All  possible  choices  of  hardware  implementation 
are  evident  from  the  description  of  a  given  algorithm.  The 
equivalence  of  existing  implementations  can  be  readily 
demonstrated. 

(6)  Both  regular  and  irregular  algorithms  can  be  modeled.  Models  of 
regular  algorithms  are  significantly  simpler  to  analyte,  since 
they  inherit  the  regularity  of  the  underlying  problem. 

fk 

Omr'  methodology  is  largely  based  upon  the  theory  of  directed  graphs  and  can 
therefore,  be  expressed  both  informally,  in  pictorial  fashion,  and  formally 
in  the  language  of  precedence  relations  and  composition  of  functions.  This 
duality  will,  hopefully,  help  to  bridge  the  gap  between  the  two  schools  of 
research  in  this  field. 


SECTION  2 


MODELING  PARALLEL  ALGORITHMS  AND  ARCHITECTURES 


The  concepts  of  'algorithm*  end  'architecture, *  which  have  been  widely 
used  for  several  decades,  still  seem  to  defy  a  formal  definition.  Books  on 
computation  and  algorithms  either  take  these  concepts  for  granted  or  provide 
a  sketchy* definition  using  such  broad  terms  as  'precise  prescription, ' 
'computing  agent,*  'well-understood  instructions,*  'finite  effort'  and  so 
forth.  The  purpose  of  this  section  is  to  provide  a  simple  formal  model  for 
modeling  and  analysis  of  (parallel)  algorithms  and  architectures.  This 
model,  which  we  call  modular  computing  network  (MCN)  exhibits  all  the 
properties  usually  attributed  both  to  algorithms  and  to  hardware 
architectures.  As  a  first  step  toward  the  formal  introduction  of  this  model 
we  extract  in  Section  2.1  the  main  attributes  of  algorithms  from  their 
characterizations  in  the  literature.  This  analysis  of  literature  leads  to 
the  conclusion  that  algorithms  can  only  be  defined  in  a  hierarchical  manner, 
i.e.,  as  well-formed  compositions  of  simpler  algorithms,  and  that  the 
simplest  (non-decomposable  algorithms)  cannot  and  need  not  be  defined.  The 
building  blocks  of  the  theory  of  algorithms  are  characterized  in  terms  of 
three  attributes:  Function  (what  building  blocks  do),  execution  time  (how 
long  they  do  it),  and  complexity  (what  does  it  cost  to  use  them).  These 
observations  are  incorporated  into  the  modular  computing  network  model,  as 
described  in  Sections  2.2  -  2.5. 


2.1  TOWARD  A  FORMAL  DEFINITION  OF  ALGORITHMS  AND  ARCHITECTURES 

In  this  section  we  attempt  to  extract  the  main  attributes  of  algorithms 
and  architectures  from  a  randomly  chosen  sample  of  'definitions.'  Most 
characterizations  of  algorithms  are  geared  to  the  notion  of  sequential 
execution.  Nevertheless,  we  shall  see  that  this  underlying  assumption  is 
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almost  never  made  explicit.  As  a  result,  the  attributes  of  parallel 
algorithms  are,  in  fact,  included  in  the  available  characterizations. 

As  a  typical  example  consider  the  following  definition.  'The  term 
'algorithm'  in  mathematics  is  tafcen  to  mean  a  computational  process,  carried 
out  according  to  a  precise  prescription  and  leading  from  given  objects, 
which  may  be  permitted  to  vary,  to  a  sought-for  result'  [11].  This 
definition  simply  states  that  an  algorithm  is  a  well-defined  input-output 
map  and  that  its  domain  contains  at  least  one  element,  and  usually  more  than 
one.  However,  the  term  'computational  process'  hints  that  an  algorithm  is 
more  than  just  a  well-defined  function.  Indeed,  *A  function  is  simply  a 
relationship  between  the  members  of  one  set  and  those  of  another.  An 
algorithm,  on  the  other  hand,  is  a  procedure  for  evaluating  a  function* 

[12]. 

But  how  are  functions  evaluated?  We  are  told  that  'this  evaluation  is 
to  be  carried  out  by  some  sort  of  computing  agent,  which  may  be  human, 
mechanical,  electronic,  or  whatever'  [12].  Thus,  the  emphasis  is  on 
physical  realizability  (the  existence  of  a  'computing  agent')  but  not  on  the 
actual  details  of  the  realization.  The  first  axiom  of  the  theory  of 
algorithms  is,  therefore: 

There  exist  basic  functions  that  are  physically  realizable. 

Further  efforts  to  define  physical  realizability  turn  out  to  be  quite 
futile.  This  is  recognized  by  Aho,  Hopcroft  and  Ullman  who  say,  'each 
instruction  of  an  algorithm  must  have  a  'clear  meaning'  and  must  be 
executable  with  a  'finite  amount  of  effort.'  Now  what  is  clear  to  one 
person  may  not  be  clear  to  another,  and  it  is  often  difficult  to  prove 
rigorously  that  an  instruction  can  be  carried  out  in  a  finite  amount  of 
time'  [13],  Physical  realizability  is  a  matter  of  technology:  What  is  non- 
realizable  today  may  become  realizable  in  a  year  or  two.  The  theory  of 
algorithms  has  to  assume  the  existence  of  realizable  basic  input-output  maps 
but  need  not  be  concerned  with  the  details  of  their  implementation. 
Therefore,  the  core  of  any  theory  of  algorithms  is  a  non-empty  collection  of 
undefined  objects,  which  we  shall  call  processors.  These  are  the  'computing 
agents'  mentioned  above,  and  they  are  assumed  to  have  three  attributes: 
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(i) 


Function  (in  input-output  map) 


(ii)  Complexity  measure 

(iii)  Execution  time 

A  processor  is  assumed  to  be  capable  of  evaluating  the  input-output  map  in 
the  specified  execution  time.  The  cost  of  utilizing  the  processor  is 
specified  by  its  complexity  measure.  Notice  that  the  notion  of  'effort' 
mentioned  above  is  a  combination  of  the  processor's  complexity  and  its 
execution  time. 

It  is  important  to  draw  a  distinction  between  an  algorithm  and  its 
description.  An  algorithm  consists  of  processors  (or  basic  functions) , 
corresponding  to  all  the  functions  that  need  to  be  evaluated.  For  instance, 
the  computation  of  sin  x  via  the  first  100  terms  of  its  MacLaurin  series 
requires  100  basic  functions,  one  for  each  term  of  the  truncated  series. 

The  description  of  the  same  algorithm  in  terms  of  instructions  requires  only 
one  instruction,  which  will  be  repeated  100  times  with  varying  coefficients. 
Since  descriptions  of  algorithms  need  to  be  communicated,  stored  and 
implemented,  they  must  be  finite,  i.e.,  contain  a  finite  number  of 
instructions.  The  algorithm  itself,  on  the  other  hand,  may  consist  of  an 
infinite  number  of  processors,  and  used  to  process  an  infinite  number  of 
inputs  into  an  infinite  number  of  outputs.  Such  are,  for  instance,  most 
signal  processing  algorithms:  Their  inputs  and  outputs  are  time-series 
which  may,  in  principle,  be  infinitely  long.  The  executabil ity  of  these 
algorithms  depends  upon  their  capability  to  compute  any  specific  output  with 
finite  time  and  effort,  and  to  use  only  a  finite  number  of  inputs  for  this 
purpose.  This  observation  also  sheds  a  new  light  on  the  concept  of 
'termination,'  which  is  usually  overemphasized  in  definitions  of  algorithms. 

The  basic  functions  comprising  an  algorithm  are  interdependent  in  the 
sense  that  the  outputs  of  one  processor  may  serve  as  inputs  to  other 
processors.  A  complete  characterization  of  an  algorithm  requires, 
therefore,  to  specify  both  its  basic  operations  and  the  interconnection 
between  these  operations.  The  same  statement  applies,  of  course,  to  block- 
diagram  representations  of  hardware#  to  flow-graphs  and,  in  fact,  to  any 
network-type  schematic.  While  algorithms  are  commonly  described  in  some 
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formal  language,  they  can  also  be  described  in  a  schematic  manner. 
Conversely,  schematic  hardware  descriptions  can  be  transformed  into  formal 
language  representations.  To  emphasize  this  equivalence  we  shall  introduce 
the  concept  of  a  modular  computing  network  (MCN),  which  exhibits  the  common 
attributes  of  both  algorithms  and  architectures.  Thus,  an  MCN  is  a  pair 

M  -  {,*,£} 

where  the  function  of  the  network,  is  essentially  the  collection  of 

basic  functions  discussed  above,  and  &,  the  arch itecture  of  the  network, 
is  a  directed  graph  describing  the  interconnections  between  basic  functions. 
A  detailed  definition  is  provided  in  Section  2.1. 

The  concept  of  modular  computing  network  is  hierarchical  by  nature. 
Basic  functions  can  be  themselves  characterized  as  networks  of  even  more 
basic  functions.  This  requires  every  MCN  to  have  the  three  fundamental 
attributes  of  a  basic  function:  Input-output  map,  complexity  and  execution 
time.  Ve  shall  show  in  the  sequel  how  to  uniquely  associate  such  attributes 
with  modular  computing  networks.  The  theory  of  MCNs  is,  in  short,  the 
theory  of  network  composition  (deducing  the  properties  of  a  network  from  its 
components)  and  network  decomposi t ion  (characterizing  the  components  and 
structure  of  a  network  whose  composite  properties  have  been  specified). 


2.2  MODULAR  COMPUTING  NETWORKS 

A  modular  computing  network  (MCN)  is  a  system  of  interconnected 
modules.  The  structural  information  about  the  network  is  conveyed  by 
specifying  the  interconnections  between  the  modules,  most  conveniently  in 
the  form  of  a  directed  graph  (Figure  2-1) •  The  functional  information  about 
the  network  is  conveyed  by  characterizing  the  information  transferred 
between  modules  and  the  processing  of  this  information  as  it  passes  through 
the  modules. 

The  structural  attributes  of  an  MCN  are  completely  specified  by  its 
architecture,  which  is  an  ordered  quadruple 

Architecture  *  (S,  T,  A,  P)  (2.2) 


b 


where  S,T  are  sett  whose  elements  are  called  sources  and  a  inks . 
respectively,  and  A,P  are  relations  between  these  sets. 

The  ancestry  relation  A  specifies  the  connections  of  sonrces  to 
sinks.  The  elements  of  A,  which  are  called  arcs,  are  ordered  source-sink 
pairs 


a  e  A  *==>  a  =  (s,t),  s  e  S,  t  e  T  (2.3) 

An  arc  represents  a  direct  transfer  of  information  from  source  to  sink.  Two 
basic  assumptions  govern  this  transfer: 

(1)  There  are  no  dangling  sources.  Every  source  is  connected  to 
exactly  one  sink, 

(2)  There  are  no  dangling  sinks.  Every  sink  is  connected  to  exactly 
one  source. 

These  assumptions  mean  that  the  three  sets  S,T,A  have  an  equal  number  of 
elements,  and  that  the  ancestry  relation  A  establishes  a  one-to-one 
correspondence  between  arcs,  sources  and  sinks,  viz., 

(s,t)  e  A  <==>  s  *  A(t)  <“>  t  =  A  X(s)  (2.4) 

This  one-to-one  correspondence  will  permit  us  to  identify  in  the  sequel  each 
arc  with  its  associated  source  and  sink,  and  to  eliminate  almost  all  sinks 
and  sources  from  the  description  of  network  architectures. 

The  processing  relation  P  specifies  the  processing  of  information 
extracted  from  sinks  into  transformed  information,  which  is  re-injected  into 
sources.  The  elements  of  P,  which  are  called  processors,  are  ordered 
pairs  of  non-empty  finite  sink-source  sequences,  viz., 

p  c  P  «*>  p  =  { t* , t„ , . . . , t  ;  s„ ,s„, • . . ,s  }  (2.5) 

x  2  m  l  z  n 

tj  «  T,  e  S,  1  <  u,  b  <  • 
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The  input  set  ( t  ,  tA,  ...»  t  )  consists  of  sll  the  sinks  from  vhich  the 
12  m 

processor  p  extracts  information.  The  transformed  information  is 

distributed  among  the  members  of  the  output  set  (s  ,  s.,  ...»  s  ).  The 

12  n 

one-to-one  correspondence  between  sources,  sinks  and  arcs  allows  us  to 
describe  processor  inputs  and  outputs  in  terms  of  arcs  and  to  almost 
completely  eliminate  the  notion  of  sources  and  sinks.  The  set  of  input  arcs 
of  a  processor  p  is  denoted  by  A^(p),  and  the  set  of  output  arcs  from 
the  same  processor  is  denoted  by  A^(p).  Each  processor  is  assumed  to  have 
unique  inputs  and  outputs,  namely 

(  *i<p)  O  A,<1>  *  * 

.  -P.q  cP.  ,  M  -->  |  Ao,p)-0  *„(,)  -i  (2'S) 

Similarly,  every  collection  of  processors,  Q  O  P#  has  its  uniquely 
defined  inputs  and  outputs,  viz., 

A  <Q)  :=  \J  A  (p)  -  Vj  A  (p)  (2.7a) 

peQ  peQ  ° 

and 

* (Q)  :=  A  (p)  -  yj  A  (p)  (2.7b) 

°  peQ  peQ  1 

In  other  words,  the  inputs  of  Q  are  those  inputs  of  processors  in  Q  that 

are  not  connected  to  outputs  of  processors  in  Q.  A  similar  statement  holds 

for  outputs  of  Q.  In  particular,  A.(P),  A  (P)  are  the  inputs  and 

&  o 

outputs  of  the  entire  network. 

Network  architectures  are  most  conveniently  described  by  a  directed 
graph  that  combines  together  the  ancestry  relation  A  and  the  processing 
relation  P  into  a  single  block-diagram-like  representation  (Figure  2-2a) . 
Sources  and  sinks  are  denoted  by  semi-circles,  processors  by  circles  and 
arcs  are,  obviously,  denoted  by  arcs.  Sources  and  sinks  are  paired,  and 
each  processor  has  its  inputs  and  outputs  adjacent  to  itself.  An  obvious 
reduction  in  notation  (Figure  2-2b)  enhances  the  comprehensibility  of  the 
description.  The  reduced  form  is,  essentially,  a  block-diagram 
representation  of  the  network  architecture,  and  can  be  interpreted  as  a 
directed  graph  f4 
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=  IV.  A) 


(2.8a) 


The  set  of  vertices  V  of  this  graph  ia 

V  *  {  A. (P) ,  P,  A  (P)  )  (2.8b) 

X  o 

where  A^(P)  ere  interpreted  as  the  sonrees  corresponding  to  the  input  arcs 
and  A0(P)  are  interpreted  as  the  sinks  corresponding  to  the  output  arcs. 
The  arcs  of  the  directed  graph  coincide  with  the  original  set  of  arcs  A. 

The  interpretation  of  network  architectures  as  directed  graphs  puts  at  our 
disposal  the  powerful  tools  and  results  of  graph  theory.  Some  of  these  will 
be  used  in  the  sequel  to  characterize  "and  analyze *the  structure  of  modular 
computing  networks. 

The  functional  attributes  of  an  MCN  are  completely  determined  by  its 
architecture  and  by  specifying  the  functional  attributes  of  each  processor. 
Thus,  the  function  of  a  network  is  an  ordered  pair 

&  =  { X ,  F)  (2.9) 

where  X,  F  are  sets  whose  elements  are  called  variables  and  maps . 
respectively. 

The  elements  of  X  are  sets  (i.e.,  domains)  and  'assigning  a  value  to 

a  variable'  amounts  to  choosing  a  particular  element  in  the  domain 

corresponding  to  that  variable.  There  is  one  variable,  xft,  associated 

with  every  arc  a  e  A  of  the  corresponding  architecture.  Consequently, 

there  is  a  one-to-one  correspondence  between  variables,  sources,  sinks  and 

arcs.  This  correspondence  makes  it  possible  to  refer  to  the  variables 

associated  with  the  inputs  of  a  given  processor  p  as  the  input  variables 

of  p  and  denote  them  by  X^(p).  A  similar  notation,  X^(p) ,  is  used  for 

the  variables  associated  with  the  outputs  of  the  processor  p. 

The  elements  of  F  are  multivariable  maps.  There  is  one  map,  f  , 

P 

associated  with  every  processor  p  s  P  of  the  corresponding  architecture. 

It  maps  the  input  variables  of  this  processor  into  the  corresponding  output 
variables,  viz.  , 
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(2.10) 


fp  :  Xi(p)  ~>  Xo(p) 

which  Beans  that  each  of  the  output  variables  is  a  function  of  the  input 
variables  (not  necessarily  of  all  the  input  variables).  This  establishes  a 
precedence  relation  between  the  inputs  and  outputs  of  a  given  processor, 
viz . , 


x  ->  y  (2.11) 

if  x  e  A^p),  y  e  AQ(p)  and  if  y  is  a  function  of  x  (and,  possibly, 
of  other  input  variables).  The  transitive  closure  of  this  relation  is  also 
a  precedence  (i.e.,  a  partial  order):  We  shall  say  that  x,  precedes  x 
if  there  exists  a  sequence  of  variables  such  that 

x  ->  x  ->  .  .  .  ->  x 
12.  n 

in  the  sense  of  (2.11).  This  global  precedence  will  also  be  denoted  by 
x^  ->  x^.  The  ancestry  [14]  of  a  variable  x  c  X  is  the  set  of  all 
variables  that  precede  x,  viz., 

a(x)  :=  {z;  z  e  X,  z  ->  x)  (2.12) 

These  are  all  the  variables  that  have  to  be  known  in  order  to  determine  the 
value  of  x. 

Since  the  function  of  a  network  consisting  of  a  single  processor  p  is 

#  =  (x.(p)  U  x  (p).  f  } 

pi  op 

there  is,  essentially,  no  distinction  between  the  function  and  the  asp  of 
p.  Thus,  the  input-output  map  of  a  single  processor  nay  also  be  called  the 
function  of  the  processor.  The  sane  is  not  true  for  a  network  consisting  of 
several  processors:  The  input-output  nap  of  a  network  is  a  single 
buI tivariable  nap,  relating  the  outputs  of  the  network  to  its  inputs;  the 
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function  of  the  network,  in  contradistinction,  is  the  collection  of  the 
atonic  maps  that  comprise  the  network.  The  analysis  problem  for 
computational  networks  is  to  determine  the  network  map  from  its  function. 

The  synthesis  problem  is  to  design  an  MCN  (i.e.,  specify  its  structure  and 
function)  that  realizes  a  given  multivariable  input-output  map. 

Modular  computing  networks  need  not  be  finite.  In  fact,  most  signal 
processing  algorithms  correspond  to  infinite  MCNs.  However,  the  concept  of 
finite  effort,  involved  in  the  evaluation  of  variables,  imposes  certain 
constraints  upon  infinite  networks.  First,  the  number  of  inputs  and  outputs 
of  every  processor  must  be  finite.  This  means  that  the  graph  (fj  describing 
the  architecture  is  locally  finite  [15].  Next,  every  variable  must  be 
computable  with  finite  effort,  so  it  will  be  required  to  have  a  finite 
ancestry,  viz . , 

|o(x)|  <  ®  for  all  x  e  X  (2.13) 

Ve  shall  also  assume  that  the  number  of  connected  components  of  the 
architecture  y  is  countable.  A  modular  computing  network  that  satisfies 
the  three  assumptions  stated  above — local  finiteness,  finite  ancestry  and 
countable  number  of  connected  components — will  be  called  structurally 
finite .  The  following  result  characterizes  the  kind  of  infinity  allowed  in 
such  networks. 


Theorem  2.1 

A  structurally  finite  MCN  has  a  countable  number  of  variables  and 
processors.  The  following  three  statements  are  equivalent: 

(1)  The  number  of  variables  is  finite. 


(2)  The  number  of  output  variables  is  finite 

(3)  The  number  of  processors  is  finite. 


Proof : 


The  countability  of  the  variables  and  processors  of  a  connected  network 
is  a  direct  consequence  of  local  finiteness  (see,  e.g.,  fl5J).  Since  each 
connected  component  has  a  countable  number  of  variables  and  processors,  tbe 
same  is  obviously  true  for  a  countable  number  of  connected  components.  Thus 
the  number  of  variables  and  processors  of  a  structurally  finite  MCN  must  be 
countable.  As  a  consequence  of  local  finiteness,  a  finite  number  of 
processors  implies  a  finite  number  of  variables  and  vice  versa,  so  (1)  and 
(3)  are  equivalent.  Clearly  (1)  implies  (2),  while  (2),  via  the  finite 
ancestry  condition,  implies  (1). 


2.3  CAUSALITY  AND  EXECUTIONS 


Tbe  definition  of  processors  in  the  previous  section  did  not  take  into 
account  any  constraints  imposed  by  hardware  implementation  considerations, 
the  most  important  among  these  constraints  is  the  causality  property.  It 
will  be  henceforth  assumed  that  an  output  of  a  processor  cannot  become 
available  before  the  inputs  of  the  same  processor  that  precede  this  output 
became  available.  In  the  beginning  all  variables  are  unavailable;  the 
inputs  of  the  network  are  made  available  at  a  given  instant,  and  following 
that  event,  all  variables  of  the  network  gradually  become  available.  This 
temporally  ordered  process,  which  we  shall  call  execution,  must  be 
consistent  with  the  precedence  relation  between  variables  induced  by  the 
directed  nature  of  the  architecture  A  network  that  possesses  an 

execution  in  which  every  variable  ultimately  becomes  available  is  said  to  be 
executable  (or  'live'  in  the  terminology  of  Petri-nets  11].  It  is  clear 
that  a  network  containing  a  cycle  cannot  be  executable  since  every  variable 
( *=  arc)  on  the  cycle  can  never  become  available.  In  order  to  satisfy  the 
causality  assumption  every  variable  in  the  cycle  must  temporally  precede 
itself  (i.e.,  it  must  be  available  before  it  becomes  available  (Fig.  2-3)), 
which  is,  clearly,  impossible.  It  turns  out  that  every  acyclic  architecture 
is  executable.  To  prove  this  result  we  shall  need  to  formalise  the  notion 
of  execution. 


1 A 


t 

/ 

/ 
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Figure  2-3. 


An  cxccnt ion  of  an  MCN  is  a  partitioning  of  its  variables  into  a 
sequence  of  finite  disjoint  sets,  viz, , 

E  ■=  {S^  0  i  i  <  -»  |Si  I  <  S£  D  Sj  =  <t>  for  i  *  j .  VJ  S.  ■=  X} 

(2.14a) 

such  that  the  precedence  relation  is  preserved,  viz., 
i-1 

a(S . )  C  \J  S.  i  *=  0,  1,  ...  (2.14b) 

j*0  J 

Here  a(S)  denotes  the  ancestry  of  the  set  S,  defined  as  the  collection 
of  sll  ancestors  of  members  of  S,  viz., 

a(S)  :*  \J  a(x)  (2.15) 

zeS 

In  simple  words,  every  ancestor  of  x  e  anst  be  contained  in  one  of  the 
sets  SQ,  which  we  shall  call  levels.  Executions  can  be 

interpreted  as  aultistep  procedures  for  evaluating  all  the  variables  in  X. 
The  aeabers  of  the  level  are  evaluated  at  the  i-th  step,  and  the 
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condition  (2.14b)  guarantees  the  availability  of  all  their  ancestors  at  the 
right  moment.  Since  the  ancestors  of  the  level  strictly  precede  Sj 

all  variables  in  this  set  can  be  evaluated  simul taneoesly  giving  rise  to  a 
parallel  execution.  If  each  set  contains  exactly  one  variable  the 

execution  vill  be  called  sequential. 

Since  each  level  S^  in  an  execution  is  finite,  the  evaluation  of  the 
variables  in  S^  from  the  members  of  the  preceding  levels  requires  finite 
effort.  Since  each  variable  belongs  to  some  level  S^,  the  total  effort 
involved  in  the  evaluation  of  a  single  variable  from  the  global  inputs  is 
also  finite.  Thus,  the  existence  of  an  execution  for  a  given  MCN  implies 
that  every  variable  can  be  evaluated  with  finite  time  and  hardware.  A 
network  .that  has  an  execution  deserves,  therefore,  to  be  called  executable . 

The  preceding  discussion  implies  that  executabil i ty  is  a  structural 
property,  since  only  the  precedence  relation  between  variables  is  involved 
in  constructing  executions.  The  following  result  presents  a  simple 
structural  test  for  executabil ity  of  MCNs. 


Theorem  2.2 

A  structurally  finite  MCN  is  executable  if,  and  only  if,  its 
architecture  is  acyclic. 


Proof : 

If  an  execution  exists,  then  it  can  be  easily  converted  into  a 
sequential  execution  by  ordering  the  variables  in  each  (finite)  level 
in  some  arbitrary  manner.  Thus,  executabil ity  is  equivalent  to  the 
existence  of  a  sequential  execution.  By  a  well-known  result  in  the  theory 
of  finite  directed  graphs,  a  sequential  ordering  exists  if,  and  only  if,  the 
graph  is  acyclic.  Thus,  the  theorem  holds  for  finite  MCNs.  The  proof  for 
infinite  networks  is  given  in  Appendix  A. 
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Corollary  2.2 

Executable  MCNs  always  have  sequential  executions. 


The  corollary  confirms  the  intuitive  notion  of  executabil i ty :  Any 
computation  that  can  be  carried  out  in  parallel  can  also  be  carried  out 
sequentially.  Parallel  execution  offers,  however,  an  attractive  trade-off 
between  hardware  and  time,  which  will  be  discussed  in  detail  in  Sec.  3.4. 

Theorem  2.2  provides  a  simple  test  for  executabil ity  and,  in  effect, 
prevents  the  construction  of  non-executable  MCNs.  Thus,  the  pitfalls  of 
starvation  and  deadlocks .  well  known  tn  the  context  of  Petri-nets  [1]  are 
easy  to  avoid.  Notice  also  that  since  each  variable  in  an  MCN  is  evaluated 
exactly  once,  safeness  [1]  is  guaranteed.  This  means  that  inputs  to 
processors  do  not  disappear  before  they  have  been  used  to  evaluate  the 
subsequent  outputs.  Safeness  is  achieved  because  once  a  variable  becomes 
available  it  stays  so  forever,  and  never  disappears. 


2.4  HIERARCHICAL  COMPOSITION  OF  MCNs 

Modular  computing  networks  are,  by  definition,  constructed  in  a 
hierarchical  manner.  A  processor  p  in  an  MCN  can  itself  be  a  network, 
provided  it  has  a  well  defined  input-output  map  f^.  In  this  section  we 
analyze  the  constraints  that  have  to  be  imposed  upon  MCN  composition  in 
order  to  guarantee  the  existence  of  a  well-defined  global  input-output  map. 

From  the  structural  point  of  view  a  composition  is  simply  a  network  of 
networks.  The  'processors'  of  the  composite  network  are  MCNs  and  the  arcs 
represent  interconnections  between  outputs  of  MCNs  to  inputs  of  other  MCNs. 
The  architecture  of  the  composition,  obtained  by  regarding  each  MCN 
component  as  a  simple  'processor'  has  to  satisfy  the  constraints  of  Sec. 
2.2.  An  architecture  is  called  admissible  if  it  satisfies  the  three 
following  constraints: 

(1)  No  dangling  inputs  and  outputs 

(2)  No  cycles 

(3)  It  is  structurally  finite 
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The  importance  of  these  constraints  lies  in  the  fact  that  an  admissible 
composition  of  admissible  architectures  is  itself  an  admissible  architecture 
(see  Appendix  B  for  proof).  It  is  interesting  to  notice  that  the 
admissibility  conditions  are  instrumental  also  in  establishing  other 
important  properties  of  architectures.  In  particular,  an  admissible 
composition  of  self-timed  elements  is  itself  a  self-timed  element  [6],  [7]. 

To  establish  the  hierarchical  nature  of  composition  it  is  only 
necessary  to  verify  that  an  admissible  composition  of  processors  vith  a 
veil-defined  input-output  map  also  has  a  veil  defined  input-output  map. 

This  vill  be  done  by  interpreting  executions  as  decompositions  of  MCNs  into 
elementary  parallel  and  sequential  combinations. 

Parallel  composition  of  tvo  architectures,  and  Xj#  it  defined  as 

the  union  of  the  tvo  netvorks  vithout  any  interconnections  betveen  g^  and 
^2  (Fig.  2-4a) .  Sequential  composition  involves  the  connection  of  every 
output  of  to  a  corresponding  input  of  thus  the  number  of  outputs 

of  (Sy  must  equal  the  number  of  inputs  of  (Fig.  4-2b).  We  shall 

denote  parallel  composition  by  g^  #  g^  an<*  sequential  composition  by 
V!/^  *  ('fj29  The  parallel  composition  of  a  countable  number  of  admissible 
netvorks  is  alvays  admissible.  The  sequential  composition  of  a  sequence  of 
admissible  netvorks  is  admissible  too,  i.e., 

&1  •  &2  •  .  .  . 

is  admissible  because  the  unilateral  nature  of  the  cascade  preserves  the 
finite  ancestry  property,  vhile  local-f initenes s  and  countability  of 
components  are  clearly  preserved. 
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Figure 


Parallel  Composition  <£^  # 


b.  Sequential  Composition 

2-4.  Fundamental  Architecture  Coapoaitiona 
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Executions  define  a  rearrangement  of  MCNs  as  a  sequential  composition 
of  subnetworks,  each  subnetwork  being  a  parallel  composition  of  processors. 
The  MCN  of  Figure  2-1  can,  for  instance,  be  described  as 

<f,  #  e  #  e)*(e  #  f.  #  e)*(f  #  e  #  f  J*(e  #  fc  #  e)*(f  #  e  #  e) 

1  2  4  3  3  0 

where  e  is  an  identity  input-output  map.  The  importance  of  this 
observation  lies  in  the  fact  that  the  input-output  map  of  any  sequential- 
parallel  composition  is  well-defined.  Consequently,  every  execution  has  a 
well-defined  input-output  map.  This  leads  to  the  following  result. 


Theorem  2.3 

Every  executable  MCN  has  a  unique  well-defined  input-output  map. 


Proof : 


See  Appendix  C. 


The  theorem  establishes  the  utility  of  the  notion  of  execution.  While 
each  execution  corresponds  to  a  different  ordering  of  the  computations 
required  to  evaluate  the  output  variables  of  an  MCN,  all  executions 
determine  the  same  input-output  map.  And,  while  each  execution  provides  a 
different  description  of  the  network,  they  all  correspond  to  the  same  MCN. 

Descriptions  of  computational  schemes  will  be  considered  equivalent  if 
they  determine  the  same  input-output  map.  They  will  be  considered 
structurally  equivalent  if,  in  addition,  they  determine  the  same  MCN. 
Structural  equivalence,  which  amounts  to  different  choices  of  executions, 
leaves  both  the  architecture  and  the  function  of  the  MCN  unchanged.  Other 
types  of  equivalence  tranaf ormationa  will  affect  both  the  architecture  and 
the  function  of  the  MCN  but  will  keep  its  input-output  map  unchanged. 
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2.5  COMPARISON  OF  MCNs  WITH  OTHER  NETWORK  MODELS 


2*5.1  Block-Diagrams  and  Finite-State  Machines 

Numerical  algorithms  are  most  frequently  described  in  terms  of 
recursion  equations  involving  indexed  quantities*  known  as  signals.  Z- 
transform  notation  and  block  diagrams  (or  signal-f low-graphs)  are  sometimes 
used  as  equivalent  descriptions  of  recursion  equations. 

The  main  difference  between  MCNs  and  Z-transform  block-diagrams  is  in 
the  dist anguished  role  of  time  in  the  latter  model.  A  cascade  connection  of 
three  blocks*  each  with  its  own  state  (Fig.  2-5a)  corresponds  to  an  MCN  of 
infinite  length  (Fig.  2-5b).  Each  row  of  the  MCN  represents  a  single  step 
of  the  recursion.  Each  input/output  is  a  single  variable*  not  a  time- 
series.  While  the  MCN  description  seems  wasteful*  it  does  in  fact  enhance 
our  understanding  of  the  various  pos s ibil i ties  of  implementation.  Moreover* 
MCNs  can  describe  irregular  algorithms  that  cannot  be  described  in  terms  of 
recurrence  equations.  This  means  that  every  block  diagram  can  be  converted 
into  an  MCN  but  not  vice  versa.  The  conversion  amounts  to  duplicating  the 
block  diagram  several  times  (once  for  every  iteration  of  the  recursion)  and 
converting  delay  elements  into  direct  connections  between  consecutive 
duplicates*  as  in  Figure  2-5. 

The  preceding  discussion  considered  only  block-diagrams  that  correspond 
to  sets  of  recursion  equations.  Such  diagrams  always  consist  of  delay 
elements  and  memoryless  operations.  This  means*  of  course*  that  only  block- 
diagrams  whose  blocks  represent  finite-state  machines  can  be  converted  in  a 
straightforward  manner  into  an  MCN.  Any  other  block-diagram  has  to  be  first 
converted  into  a  state-space  form  (i.e.*  every  block  has  to  be  represented 
by  a  state-space  model  or  a  combination  of  such  models)  before  it  can  be 
converted  into  an  MCN.  Thus*  in  particular*  any  signal-flow-graph  with 
rational  transfer  functions  can  be  transformed  into  an  MCN. 

The  correspondence  between  block-diagrams  and  MCNs  provide  a  simple 
test  for  the  executabil ity  («  computability)  of  algorithms  represented  by 
block  diagrams. 
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Except abi 1 i ty  Test 

A  finite  block-diagram  (or  ai gnal-f low-graph)  whose  blocks  are 
characterized  by  delay  elements  and  memoryless  maps  is  executable  if,  and 
only  if,  the  directed  graph  obtained  by  deleting  delay  elements  from  the 
diagram  (or  equivalently,  by  setting  z  *  =  0  in  the  transfer  functions)  is 
acycl ic. 


Proof : 


Since  delay  elements  are  causal,  they  can  never  give  rise  to  cycles  in 
the  corresponding  MCN.  In  other  words,  since  all  operations  in  the  i-th 
iteration  temporally  precede  all  operations  in  the  (i+l)-th  iteration,  the 
only  cycles  the  MCN  representation  of  a  block-diagram  may  have  must  be 
contained  within  a  single  layer,  corresponding  to  a  single  iteration.  A 
single  layer  of  the  MCN  is  obtained  by  removing  all  delay  elements  from  the 
block-diagram . 


The  test  not  only  establishes  the  executabil i ty  of  a  given  block- 
diagram  but  indicates  how  to  transform  non-executable  networks  into 
executable  ones.  Consider,  for  instance,  the  network  in  Figure  2-6a.  It  is 
non-executable  if  H(«»)  #  0,  because  a  cycle  exists  in  the  network  for 
z  *»  ••  However,  the  same  transfer  function  can  be  realized  by  the  network 
in  Figure  2-6b,  which  is  executable. 


2.5.2  Pata-Flow-Granhs  and  Petri-Nets 

The  MCN  is,  clearly,  a  data-flow-graph  [18]  with  the  additional 
constraint  that  only  one  token  is  placed  at  every  input  of  the  network,  and 
consequently,  only  one  token  eventually  appears  at  every  output  of  every 
processor.  Thus,  an  MCN  is  safe  by  definition.  In  spite  of  this 


23 


a.  Non-Executable  Network 


b.  Equivalent  Executable  Network 

Figure  2-6.  Transformation  of  a  Non-Executable  Network 
into  an  Equivalent  Executable  One 
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observation  every  data-flow-graph  (safe  or  unsafe)  can  be  converted  into  an 
MCN ,  as  long  as  every  firing  of  a  vertex  in  the  flow-graph  reaioves  one  token 
from  every  input  line  and  adds  one  token  to  every  output  line.  This 
constraint  implies  that  the  data-flow-graph  can  be  converted  into  a  block 
diagram  involving  only  delay  elements,  advance  elements  and  memoryless  maps. 
This  block-diagram  can  in  turn  be  converted  into  a  (not  necessarily 
executable)  MCN.  The  executabil ity  condition,  when  transformed  back  to  the 
data-flow-graph  domain  becomes  a  cycle  sum  test,  as  described  in  [19] . 

Petri-nets  are  more  general  than  data-flow-graphs.  They  allow  two 
different  kinds  of  vertices,  known  as  places  and  conditions.  Conditions 
correspond  to  our  concept  of  processors,  while  places  are  combinations  of 
multiple  sources  and  sinks  and  thus  have  no  counterpart  in  the  MCN  model. 
Petri-nets  whose  places  have  at  most  one  input  and  at  most  one  output  are, 
in  fact,  data-flow-graphs  (also  known  as  marked  graphs  [20])  and  can  be 
converted  into  MCNs. 


2.5.3  High-Level  Programming  Languages 

Most  high-level-language  computer  programs  can  be  converted  with  little 
difficulty  into  MCNs.  Each  assignment  statement  of  the  program  bee  mes  a 
processor  in  the  corresponding  MCN.  Program  variables  are  mappeu  into 
network  variables  according  to  the  following  rules: 

(i)  Each  program  variable,  say  x,  is  mapped  into  several  network 
variables,  denoted  by  x^,  x^,  etc. 

(ii)  An  occurrence  of  a  program  variable  x  in  the  right-hand-side 
of  an  assignment  statement  is  mapped  into  the  same  network 
variable  x^  as  the  preceding  occurrence  of  the  same  variable 
in  the  program. 

(iii)  An  occurrence  of  a  program  variable  x  in  the  left-hand-side  of 
an  assignment  statement  is  mapped  into  a  new  network  variable, 
i.e.,  into  z£4i  if  the  most  recent  occurrence  was  mapped  into 
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Recursions  (do-loops)  are  napped  into  sequential  compositions  of  identical 
processors*  each  processor  corresponding  to  one  step  of  the  recursions.  The 
mapping  of  conditional  recursions  ('if*  and  'while*  statements)  is  somewhat 
more  complicated  and  will  not  be  described  here.  A  separate  technical  memo 
will  be  devoted  to  the  details  of  converting  computer  programs  and  other 
descriptions  into  MCNs*  and  vice  versa. 

The  conversion  of  an  MCN  into  a  computer  program  is  straightforward: 
Each  processor  is  mapped  into  several  assignment  statements*  and  each 
network  variable  is  mapped  into  a  program  variable.  As  an  example  consider 
a  simple  computer  program  (Table  2-1)  written  as  a  subroutine  to  emphasize 
the  role  of  inputs  and  outputs.  The  corresponding  MCN  is  given  by  the  same 
table*  and  is  described  graphically  in  Figure  2-7.  Notice  that  the  order  of 
assignment  statements  of  an  MCN  is  inconsequential':  Any  arrangement  of 
these  statements  conveys  exactly  the  same  information.  Also  notice  that  we 
have  the  option  of  aggregating  several  statements  with  the  same  inputs  into 
one  processor  in  order  to  enhance  the  comprehensibility  of  the 
representation.  The  MCN  representation  of  Table  2-1  is*  in  fact*  the 
'computer  program'  equivalent  of  Figure  2-7*  so  that  no  translation  is 
required  once  such  a  'formal  language'  representation  is  available. 
Translation  of  MCNs  into  computer  programs  usually  results  in  poor 
utilization  of  computer  resources.  This  inefficiency  can*  however*  be 
easily  handled  at  the  compiler  level.  On  the  other  hand*  the 
comprehensibility  of  MCN  representations  is  much  better  than  the  average 
computer  program. 

2.5.4  Summary 

The  preceding  analysis  has  shown  that  MCNs  are  essentially  equivalent 
to  computer  programs*  to  block  diagrams  involving  finite-state-blocks*  and 
to  a  subclass  of  Petri-nets  (marked  graphs).  The  major  distinction  between 
MCNs  and  most  other  representations  is  the  embedding  of  the  notion  of 
executabil i ty  into  the  MCN  model  itself.  Thus*  the  only  way  to  design  non¬ 
executable  MCNs  is  by  the  introduction  of  cycles  in  the  network 
architecture.  Moreover*  the  test  for  executabil ity  is  very  easy  to  carry 
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MCN  REPRESENTATION  COMPUTER  PROGRAM 


TABLE  2-1.  CONVERSION  OF  COMPUTER  PROGRAMS  INTO  MCN*  AND  VICE  VERSA 


SUBROUTINE  EXAMPLE  (X,Y,Z,W) 

YTEM  =  X*Y 

X  =  X  +  Y 

ZTEM  =  Z*W 

W  =  Z  +  W 

Y  =  YTEM  +  ZTEM 

Z  =  YTEM  -  ZTEM 

RETURN 

END 


MCN  EXAMPLE  (XI ,Y1 ,Z1 ,W1 ;X2,Y2,Z2,W2) 

1)  YTEM  =  XI *Y1 
X2  =  XI  +  Y1 

2)  ZTEM  =  Z1*W1 
W2  =  Z1  +  W1 

3)  Y2  =  YTEM  +  ZTEM 
Z2  =  YTEM  -  ZTEM 
END 


Figure  2-7.  The  MCN  'EXAMPLE'  Corresponding  to  Table  2-1. 


out  and  can  be  included  in  any  compiler  for  MCN  representations.  It  is  much 
easier,  on  the  other  hand,  to  design  malfunctioning  Petri-nets  or  computer 
programs,  and  much  more  difficult  to  detect  the  errors  in  the  design. 


SECTION  3 

STRUCTURAL  PROPERTIES  OF  MCNs 


The  notion  of  execution,  defined  in  the  previous  section,  provides 
several  quantitative  characterizations  of  the  MCN  architecture.  In 
particular,  it  can  be  used  to  number  the  processors  of  an  MCN  and  to 
introduce  concepts  of  dimensionality,  A  refinement  of  the  notion  of 
execution  leads  to  time  schedules  and  to  the  formulation  of  composition 
rules  for  execution  times.  Thus,  the  objective  of  associating  a  unique 
execution  time  with  every  output  of  an  MCN  is  achieved.  The  third 
objective,  that  of  associating  a  unique  measure  of  complexity  with  each  MCN, 
has  yet  to  be  accomplished.  Currently  there  is  no  consensus  even  upon  the 
measure  of  complexity  for  a  single  processor,  let  alone  for  a  network  of 
processors.  Some  progress  has  been  made  in  characterizing  complexity  in 
terms  of  'area,'  but  more  research  is  required  before  commonly-accepted 
rules  for  composition  of  complexity  can  be  formulated.  For  this  reason  the 
topic  of  complexity  will  not  be  considered  in  the  sequel. 


3.1  NUMBERING  OF  VARIABLES  AND  PROCESSORS 

The  concept  of  execution,  which  was  defined  in  Section  2.3,  defines  a 
numbering  E(x)  on  the  variables  of  an  MCN,  viz.. 


x  t  S A  <“>  E(x )  *  i 


(3.1) 


Since  the  partitioning  {S^}  and  the  numbering  E(  )  determine  each  other 
and  convey  equivalent  information,  we  shall  call  the  function  E(  )  itself 
an  execution.  Several  variables  may  share  the  same  value  of  E(  ) ,  which 
means  they  belong  to  the  same  level  S^.  If  each  level  of  an  execution 
contains  exactly  one  variable  the  execution  is  called  sequential .  The 
function  E(x)  defines,  in  this  case,  a  sequential  ordering  of  the 


variables  and  of  the  processors  comprising  the  MCN.  The  numbering  of 
variables  determined  by  an  execution  £(  )  is  consistent  with  the 
precedence  relation  since  we  clearly  have 

E(x)  2.  1  +  »ax  (E(y);  y  s  a(x))  (3,2) 

y 

Similarly ,  we  can  define  a  numbering  of  the  processors  by 

E(p)  :=  max  (E(x);  x  e  X^p)}  (3.3) 

x 

The  value  of  C(p)  indicates  the  earliest  instant  at  which  all  Inputs  of 
the  processor  p  become  available.  Ye  can  also  define  a  precedence 
relation  for  processors,  viz.. 


q  ->  p 

if  there  exists  a  directed  path  from  q  to  p.  This  relation,  in  turn, 
determines  the  ancestry  set  a(p)  of  each  processor  by 

a(p)  :=  {q;  q  e  P,  q  ->  p)  (3.4) 

It  can  now  be  seen  that  an  analog  of  (3.2)  holds  for  the  numbering  of 
processors ,  viz .  , 

E(p)  11+  max  {E( q)  ;  q  e  a(p))  (3.5) 

q 

Since  a  typical  MCN  has  fewer  processors  than  variables,  the  numbering  of 
processors  is  a  more  convenient  tool  for  structural  analysis  of  an  MCN . 


3.2  DIMENSIONALITY  AND  ORDER 

A  family  of  sequential  executions  (E^(  ))  on  a  given  MCN  is  called 
representative  if 
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(3.6) 


q  e  o(p)  <=*=>  E  j  ( q)  <  E^p),  all  i 

Notice  that  a  representative  family  can  never  consist  of  a  single  execution 
(except  in  the  case  of  a  purely  sequential  MCN)  because  there  exist  always 
two  processors  q,  p  such  that  E(q)  <  E(p)  even  though  q  does  not 
precede  p  (nor  does  p  precede  q)  .  The  following  result  shows  that 
every  MCN  has  at  least  one  representative  family. 


Theorem  3.1 

The  collection  of  all  sequential  -executions  of  a  given  MCN  is  a 
representative  family. 


Proof : 

By  the  definition  of  execution 
q  e  a  (p)  =«>  E(q)  <  E(p) 

for  every  execution  (sequential  or  not).  To  prove  the  converse  assume  that 
(E^(  ))  is  the  collection  of  all  sequential  executions,  and  that  for  some 
processors  p,  q 

EjU)  <  E^p),  all  i 

Clearly  p  cannot  precede  q,  but  they  may  be  incomparable.  In  this  case 
there  exists  a  non-sequential  execution  E(  )  such  that 

E(p)  *  E(q) 

Since  every  execution  can  be  transformed  into  a  sequential  one  by 

arbitrarily  ordering  the  variables  in  each  level,  it  follows  that  E  can  be 

converted  into  a  sequential  execution,  say  E  ,  such  that 

o 


31 


F  (q)  >  E  (p) 
o  o 


This,  however,  contradicts  the  assumptions.  Hence,  p,  q  cannot  be 
incomparable  and  we  must  have  q  e  a  (p)  . 


A  representative  family  with  the  smallest  number  of  members  will  be 

called  a  basis  (it  need  not  be  unique).  The  cardinality  of  bases  is  defined 

as  the  dimensional i tv  of  the  MCN  in  consideration.  The  members  of  a  basis 

(£^(  ))  define  a  coordinate  basis  for  the  network,  such  that  the 

coordinates  of  a  processor  p  are  (E-(p),  E^(p),  ...,  E  (p)) .  Notice  that 

i  i.  n 

the  dimensionality  of  a  network  is  bounded  below  by  the  dimensionality  of 
all  its  subnetworks,  so  adding  long  chains  of  processors  to  a  2-dimensional 
network  cannot  reduce  the  overall  dimension  below  2  (Figure  3-1). 

Every  basis  of  an  MCN  determines  a  unique  non-sequential  execution 
obtained  by  ordering  the  processors  according  to  the  sum  of  their  basis 
coordinates.  For  the  example  of  Figure  3-1  this  execution  is 


(1),  (2,3)  (4)  (5)  .  .  .  (n) 


The  order  of  a  basis  is  defined  as  the  number  of  variables  in  the  largest 
layer  of  the  parallel  execution  determined  by  the  basis.  For  the  example 
above  the  order  is  2  since  there  is  a  set  of  2  processors  in  the  parallel 
execution.  Since  an  MCN  may  have  many  bases  it  has  no  unique  order. 
Moreover,  each  execution  E  (not  necessarily  associated  with  a  basis)  has 
its  own  order,  defined  by 

ord  (E)  :=  max  {p;  E(p)  -  i)  (3.7) 

i 

Executions  can  be  implemented  in  hardware  by  mapping  each  layer  into  a 
single  iteration,  with  all  the  processors  in  the  layer  implemented  in 
parallel.  The  order  of  an  execution,  which  is  the  number  of  processors  in 
the  largest  layer,  is  therefore  a  measure  of  the  hardware  complexity  of  such 
an  implementation. 
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Once  we  have  coordinate  bases  at  our  disposal  we  can  apply  metric 
arguments  to  the  represents ti on  of  an  MCN .  For  instance,  we  can  define 
distances  between  processors  and  introduce  the  concept  of  local 
communication  between  processors  in  a  rigorous  manner.  However,  more 
research  is  required  to  establish  the  properties  of  metrics  defined  by 
coordinate  bases;  in  particular,  it  is  not  yet  clear  how  the  choice  of  the 
coordinate  basis  affects  the  metric. 

3.3  SCHEDULES,  DELAY  AND  THROUGHPUT 

The  execution  of  an  MCN  represents  only  its  precedence  relation  and 
does  not  take  into  account  the  actual  time  required  for  execution.  The 
evaluation  of  each  variable  requires  a  certain  amount  of  execution  time  when 
implemented  in  hardware.  Since  each  output  of  a  processor  may  involve  a 
different  execution  delay,  execution  times  have  to  be  specified  for  arcs  of 
the  precedence  graph  rather  than  for  the  vertices.  The  execution  time 
associated  with  a  variable  x  will  be  denoted  in  the  sequel  by  T(x).  This 
is  the  time  required  to  evaluate  x  from  its  immediate  ancestors 
(=  parents),  i.e.,  from  the  variables  that  serve  as  inputs  to  the  processor 
whose  output  is  the  variable  x. 

The  incorporation  of  time  delays  into  the  notion  of  execution  results 
in  a  schedule ,  which  is  formally  defined  as  a  function  x(x)  that  satisfies 
the  constraint 

t ( x )  2.  T(x)  +  max  {x ( y >  ;  y  e  a(x)}  (3.8a) 

y 

and  is  zero  for  the  network  inputs,  viz., 

x  e  X^(P)  **>  x(x)  *=  0  (3.8b) 

This  constraint  guarantees,  in  particular,  that  the  parents  of  x  will  be 
available  at  time  t(x).  Thus,  schedules  are  refinements  of  executions.  In 
particular ,  with  every  execution  E(  )  we  can  associate  a  schedule  r(  ) 
by  choosing 

t ( x )  -  max  Ixiy)  +  T(x)  ;  E(y)  *  E(x)  -  1)  (3.9) 

y 
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Such  schedules  are,  generally,  non~minimal  in  the  sense  that  some  operations 
have  all  their  inputs  available  before  their  scheduled  execution  time,  i.e., 
(3.8)  holds  with  a  strict  inequality  for  such  operations.  A  schedule  which 
satisfies  (3.8)  with  equality  for  every  x  e  X  is  called  minimi . 

Minimal  schedules  are  important  because  they  characterise  the  fastest 
executions  of  a  given  MCN.  This  property  is  made  explicit  by  the  following 
result. 


Theorem  3.2 

Every  structurally  finite  MCN  has  a  unique  minimal  schedule  r(  )•  The 
minimal  schedule  satisfies 

t(x)  <,  t ( x )  (3.10) 

for  every  x  e  X  and  for  every  schedule  t(  ) . 


Proof : 

Since  by  Theorem  2.1  a  structurally  finite  MCN  has  a  countable  number 
of  variables,  the  result  can  be  established  by  induction.  Thus,  let  S  be 
a  subset  of  X  that  is  closed  under  the  ancestry  relation,  namely  for  every 
x  e  S  we  must  have  a(x)d  S.  Assume  that  S  has  already  been  assigned  a 
minimal  schedule  t(  )  and  that  this  schedule  also  satisfies  (3.10). 

Choose  a  variable  y  not  in  S  and  consider  the  augmented  network 
determined  by  S  \Ja(y) ,  We  need  to  show  that  t(  )  can  be  extended  to 
this  augmented  network  and  that  it  will  satisfy  both  (3.8)  and  (3.10)  The 
schedule  x (  )  is  now  extended  to  a(y)  in  the  following  manner: 

(i)  Assign  t(z)  *  0  to  every  z  e  a(y)  that  has  no  ancestors. 

(ii)  Identify  the  collection  of  variables  for  which  all  ancestors  have 
already  been  assigned  a  schedule  (this  set  is  never  empty). 

Assign  to  each  one  of  these  variables  the  schedule 

t(z)  :  *  T(z)  +  max  (r(w);  w  c  a(z)} 
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w 


For  every  w  e  a(z),  either  t(w)  *  0  or  w  e  S,  so  that 
r(w)  <  x(w)  for  any  schedule  x(  ).  Since  any  schedule  x(  ) 
has  to  satisfy  (3.8)  we  obtain 

x(z)  £  T(z)  ♦  aai  (x(w) ;  w  e  a(z)) 

w 

i_  T(x)  +  sax  (x(w);  w  e  a(x))  *  *(*) 

w 

which  proves  that  (3.10)  is  preserved  in  this  step. 

(iii)  Augment  S,  viz., 

S  :=  S  <Ja(y) 
and  go  back  to  ( ii) . 

The  repeated  application  of  this  procedure  results,  in  the  assignment  of 
x(x)  to  every  variable  of  the  MCN.  The  resulting  schedule  is  ninimal, 
i.e.,  it  satisfies  (3.8)  with  an  equality,  unique  (by  construction)  and  also 
satisfies  (3.10). 


As  with  executions,  we  can  also  define  schedules  for  processors.  The 
schedule  of  a  processor  p  e  P  is  defined  as 


x(p)  :  «  max  {  x(x);  x  c  X^p)  ) 


(3.11) 


in  analogy  with  3.3.  It  is  the  instant  at  which  all  input  variables  of  the 
processor  become  available.  Some  of  the  inputs  of  the  processor  may  become 
available  earlier  and  need,  therefore,  storage  or  buffering  until  they  can 
actually  be  used.  A  variable  x  is  called  critical  with  respect  to  a  given 
schedule  x(  )  if 


x  c  X^(p)  =*=>  x ( x )  -  x(p) 


(3.12) 


and  non-critical  otherwise.  Thus,  the  schedule  of  each  processor  is 
determined  by  the  schedule  of  its  critical  inputs.  Since  non-critical 
variables  require  storage  the  general  objective  of  scheduling  is  to  reduce 
the  total  storage  requirements. 
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Storage  is  measured  by  the  product  of  volume  (e.g.,  the  number  of  bits 
to  be  stored)  and  duration.  The  duration  of  storage  for  a  variable 
z  e  X^(p)  is  the  difference  between  the  time  it  becomes  available  and  the 
most  recent  instant  it  still  needs  to  be  available,  i.e., 

max  (x(y)  -  T(y) ;  y  e  XQ(p),  x  ->  y)  -  x(x) 

This  interval  will  be  minimized  if  we  choose  the  difference  r(y)  -  T(y)  as 
short  as  possible.  In  view  of  (3.8),  we  have  to  choose  t(y)  -  T(y)  *  x(p), 
namely  the  minimal  schedule  also  minimizes  the  storage  requirements  of  the 
network.  The  minimal  schedule  still  has  both  critical  and  non-critical 
variables.  However,  only  the  critical  ones  determine  the  schedule,  as 
demonstrated  by  the  following  result.  * 


Lemma  3,3 

Every  processor  in  a  structurally  finite  MCN  is  connected  to  a  network 
input  by  a  finite  path  whose  variables  (arcs)  are  critical  under  the  minimal 
schedule . 


Proof : 

The  definition  of  a  critical  variable  implies  that  every  processor  has 
at  least  one  critical  input  variable.  The  critical  path  is  obtained  by 
tracing  back  through  the  critical  inputs  of  the  preceding  processors.  Since 
the  ancestry  of  each  processor  is  finite,  this  procedure  terminates  in  a 
finite  number  of  steps  when  the  path  reaches  a  network  input. 
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Corollary  3.,3 


The  minimal  schedule  of  a  processor  equals  the  length  (sum  of 
processing  delays)  of  a  critical  path  that  connects  a  network  input 
processor . 


to  this 


The  corollary  implies  an  interesting  principle  for  the  physical  design 
of  hardware  implementations — critical  paths  need  to  be  considered  first  so 
that  the  length  of  the  physical  connections  along  the  path  can  be  minimized. 
Non-critical  paths  can  accommodate  extra  propagation  delays  and  can,  there- 
fore,  be  designed  later. 

The  construction  of  a  schedule  is  based  upon  'the  assumption  (3.5b)  that 
all  MCN  inputs  are  available  at  the  very  beginning.  Thus,  a  zero  schedule 
was  assumed  in  (3.8)  for  every  MCN  input,  i.e.. 


x  c  X^  (P)  ==>  t ( x )  -  0 

This  is,  however,  inessential,  since  many  of  these  inputs  will  not  be 
required  until  much  later.  The  scheduling  of  the  network  inputs  can  be 
modified,  once  a  schedule  t(  )  has  been  determined,  to  reflect  the 
earliest  instant  they  are  required  in  the  execution.  Thus,  for  every 
x  e  X^(P)  redefine  the  schedule  of  the  inputs  to  be 

x  c  X^P)  **>  t(x)  :=  x(p)  where  x  e  X.(p)  (3.13) 

and  no  buffering,  or  storage,  of  the  inputs  will  be  necessary.  This  it 
particularly  important  if  not  all  the  inputs  can  be  made  available  in  the 
same  instant,  e.g.,  in  real  time  processing  of  time-series.  Notice  that 
this  modification  in  the  scheduling  of  inputs  does  not  affect  the  schedule 
of  any  other  variable  in  the  network.  This  is  so  because  only  non-critical 
input  variables  are  adjusted.  The  meaning  of  (3.13)  is  that  all  network 
inputs  are  made  critical  to  reduce  the  storage  requirements  of  the  network. 

The  schedule  of  output  variables  is  commonly  known  as  delay.  The  delay 
of  x  is  the  time  hat  has  elapsed  from  the  moment  some  variable  in  a(x) 
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becomes  available  until  the  moment  the  variable  z  itself  becomes 
available.  This  is,  clearly, 

x(x)  -  min  tx ( y ) ;  y  c  a(x)} 


and  in  many  cases  it  will  be  equal  to  x(x).  In  typical  signal  processing 
applications  the  delay  of  outputs  usually  increases  without  limit  as  more 
and  more  inputs  are  applied  to  the  processor  and  more  and  more  outputs  are 
evaluated.  In  such  cases  one  is  interested  in  the  rate  of  output 
evaluation,  commonly  known  as  throughput .  rather  than  in  the  delay  of  the 
outputs.  The  throughput  is  roughly  the  number  of  MCN  outputs  that  are 
evaluated  in  a  unit  of  time.  Since  this  rate  may  .vary,  we  need  a  more 
rigorous  definition  based  on  the  concept  of  schedule. 

Every  schedule  determines  a  temporal  ordering  of  the  MCN  variables  (it 
need  not  be  sequential),  which  is  consistent  with  the  precedence  relation 
between  variables.  In  order  to  quantify  the  rate  at  which  output  variables 
are  evaluated,  we  define  the  output  counting  function 


N  (x)  number  of  elements  in  the  set  (3.14) 

o 

{y;  y  e  Xq(P) ,  x(y)  <  t) 

The  input  counting  function  can  be  similarly  defined,  viz.. 


N^(t)  :  =  number  of  elements  in  the  set  (3.15) 

(y,  y  e  X^P)  ,  x(y)  <  x) 

Ve  can  now  plot  the  counting  function  N(r)  as  a  function  of  t  for  both 
the  inputs  and  the  outputs  (Figure  3-2).  The  functions  N^(t);  Nq(t)  are, 
of  course,  staircase  functions  (indicated  by  broken  lines  in  Figure  3-2)  and 
can  be  upper-bounded  by  a  pair  of  continuous,  piecewise-linear  curves 
(indicated  by  the  solid  lines  in  Figure  3-2).  The  slope  of  these 
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curves  (which  are  always  strictly  monotone  increasing)  is  a  measure  of  the 
rate  of  information  flow  into  the  network  and  out  of  it,  and  will  be  called 
the  input  and  output  throughput,  respectively.  A  schedule  is  called  yeaular 
when  both  its  input  and  output  throughput  are  periodic  with  the  same  period 
(and,  in  particular  when  both  throughputs  are  constant).  An  MCN  is  called 
temporal ly-regular  when  its  minimal  schedule  is  regular.  Many  temporally- 
regular  networks  have  equal  input  and  output  throughputs,  but  this  need  not 
be  true,  in  general. 


3.4  SPACE-TIME  DIAGRAMS 


The  continuous-time  character  of  the  schedule  is  best  demonstrated  by 
introducing  a  time-axis  into  the  graphical  description  of  an  MCN.  The 
vertices  are  arranged  so  that  the  vertical  displacement  from  the  top  of  the 
diagram  to  the  location  of  any  given  vertex  p  indicates,  on  an  appropriate 
scale,  the  value  of  the  schedule  r(p)  for  this  vertex  (Figure  3-3,  compare 
with  Figure  2-1).  This  space-time  diagram  has  several  interesting 
properties : 

(1)  All  arcs  point  downward. 

(2)  The  vertical  displacement  of  an  arc  indicates  the  total  execution 
time  associated  with  this  operation,  including  any  buffering  time 
that  may  be  required  beyond  the  actual  execution  time  T(x). 

(3)  Changes  in  local  execution  times  are  easily  accounted  for  by 
shifting  the  corresponding  vertices  up  or  down  along  the  time 
scale.  The  global  effects  of  such  shifts  are  clearly  depicted  by 
the  diagram. 

(4)  Non-executable  MCNs  (with  zero  or  negative  execution  times)  can 
still  be  described  by  the  diagram.  This  is  useful  to  establish 
equivalence  between  various  descriptions  of  the  same  MCN  (e.g., 
precedence  graphs  and  aignal  flow  graphs) • 
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The  collection  of  processors  (vertices)  with  the  same  schedule  form  an 
isochrone . 

The  execution  of  a  network  according  to  a  given  schedule  may  now  be 
interpreted  as  the  propagation  of  a  single  wavefront  of  activity  through  the 
architecture.  The  location  of  the  activity  wavefront  at  any  given  instant 
is  indicated  by  the  corresponding  isochrone.  Observe  that  the  isochrones 
are  parallel  straight  lines  (or  parallel  planes  if  the  precedence  graph  is 
described  in  a  three  dimensional  space)  and  do  not  intersect.  Also  notice 
that  the  inputs  and  outputs  of  a  temporal ly-regular  network  are  evenly 
distributed  in  time  (i.e.,  along  the  vertical  axis  of  the  space-time 
diagram).  These  properties  are  particularly  significant  for  the  analysis  of 
iterative  MCNs,  which  is  carried  out  in  Section  4*. 

As  an  illustration  of  the  equivalence  between  various  descriptions  of 
the  same  MCN  consider  the  block-diagram  of  an  HR  filter  (Figure  3-4a) .  The 
corresponding  MCN  (Figure  3-4b)  can  be  rearranged  in  many  ways  without 
modifying  the  architecture  of  the  network.  However,  if  Figure  3-4b  is 
interpreted  as  a  space-time  diagram  (with  time  being  the  vertical  axis), 
such  modifications  result  in  different  schedules  and  also  in  different 
block-diagrams.  In  particular,  the  delay  elements  can  be  moved  to  the  lower 
path  (Figure  3-$)  or  split  between  the  two  signal  paths  (Figure  3-6).  The 
latter  version  is  the  only  one  that  can  be  implemented  in  hardware  because 
it  contains  only  downward-pointing  arrows;  the  other  two  versions 
require  instantaneous  evaluation  of  each  variable  associated  with  a 
horizontal  arrow.  The  third  description  makes  it  also  clear  that  the  time 
interval  between  successive  application  of  inputs  is  equal  to  two  delay 
units.  It  is  also  possible  to  associate  unequal  computing  times  with  the 
forward  and  backward  propagation  through  each  block.  After  all,  the  forward 
path  only  feeds  information  through  the  block  while  the  backward  path 
involves  a  mul t iply-and-add  operation.  The  resulting  space-time  diagram 
(Figure  3-7)  has  delays  T^,  Tfe  associated  with  the  forward  and  backward 
paths,  and  the  input  interval  is  clearly  T^  +  T^ .  Notice  that  the  block 
diagram  description  involves  two  different  delay  blocks:  This  is  known  as  a 
multirate  implementation  [8].  The  throughput  rates  are,  nevertheless,  equal 
to  (T^  +  T^)  *  for  both  the  input  and  the  output. 

The  same  technique  can  be  applied  to  analyze  the  several  proposed 
systolic-array-like  implementations  for  matrix  multiplication:  the 
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hexagon*!  array  of  H.T.  Kung  [51,  the  improved  hexagonal  array  of  feiaer  and 
Davia  [4] ,  the  wavefront  array  processor  of  S.T.  Rung  [7]  and  the  direct 
form  realization  of  S.  Rao  [10]*  Details  are  provided  in  Appendix  E* 

The  analysis  of  the  previous  examples  makes  it  clear  that  the  common 
MCN  architecture  shared  by  all  the  representations  of  a  given  processing 
system  induces  certain  invariants*  For  instance,  the  total  number  of 
outputs  of  each  processor  remains  invariant,  even  though  in  some 
representations  some  of  these  outputs  are  connected  to  a  local  memory  rather 
than  to  a  nearby  processor  (Figure  3—8 > .  The  same  is  true  for  the  total 
number  of  inputs  of  each  processor.  Notice  that  the  blocks  in  Figure  3-8a 
are  still  the  same  as  in  Figure  3~4a»  including  the  orientation  of  paths 
(one  forward,  one  backward)*  On  the  other  hand,  the  roles  of  the  blocks  are 
quite  different;  in  particular,  outputs  are  obtained  from  the  local  memories 
rather  than  from  the  left-most  block  alone,  as  in  Figure  3-4a* 
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Figure  3-6.  Schematic  Deacription  #3  of  as  1IR  Filter 


Figure  3-8.  Scheaatic  Representation  of  IIR  Filter  Involving  Local  Neaory 


SECTION  4 


ITERATIVE  NETWORKS 


An  MCN  is  called  iterative  when  it  can  be  described  as  a  sequential 
composition  of  identical  subnetworks,  i.e.. 


£ 


network 
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Each  of  the  identical  components  will  be  called  an  iteration.  One 

reason  for  this  name  is  that  the  MCN  can  be  executed  by  implementing  a 
single  component  f,  in  hardware  and  simulating  a  sequential  composition  of 
such  components  by  spreading  the  execution  of  the  components  in  time.  The 
motivation  for  studying  iterative  MCNs  is  that  most  signal-processing 
algorithms  and,  in  particular,  all  systolic-array-like  architectures  can  be 
described  by  such  networks.  Observe  that  every  block-diagram  representation 
corresponds  to  an  iterative  MCN.  The  iterative  structure  induces  certain 
regularity  constraints  upon  the  MCN  which  lead  to  a  simplified 
representati on. 


4.1  PROPERTIES  OF  ITERATIVE  MCNs 

The  minimal  schedules  of  iterative  networks  are,  clearly,  periodic  with 
the  same  period  for  input  and  output  schedules.  Thus  iterative  MCNs  are 
temporal ly-regular.  In  addition,  they  are  functional ly-reaular  in  the  sense 
that  each  iteration  involves  the  same  function  Consequently,  their 

properties  can  be  determined  by  analyzing  a  single  iteration.  For  instance, 
the  entire  network  is  acyclic  (hence  executable)  if  a  single  iteration  is 
acyclic.  In  particular,  the  executabil i ty  of  z-transform  representations  of 
iterative  MCNs  is  tested  by  removing  all  separators  and  examining  the 
remaining  directed  graph  for  occurrence  of  cycles  (see  also  [9]). 


Similarly,  the  (minimal)  schedule  of  the  network  can  be  determined  by 
considering  a  single  iteration. 

Iterative  MCNs  are  commonly  described  by  recursion  equations  (or 
equivalently  by  z-transform  diagrams),  data-flow  diagrams  (marked  graphs), 
or  by  'do-loops'  in  high-level  programming  languages.  While  precedence 
graphs  of  iterative  networks  still  indicate  all  possible  executions, 
recursion  equations  restrict  the  choice  of  execution  to  one  or  at  most  two 
possibilities  (Figure  4-1).  And  while  precedence  graphs  avoid  the  pitfall 
of  non-executable  iteration  by  explicitly  describing  each  iteration  as  part 
of  an  executable  (acyclic)  precedence  graph,  data-flow  diagrams  contain 
cycles  which  may  cause  the  entire  MCN  to  be  non-executable. 

Since  all  iterations  are  identical,  the  schedules  of  every  two 
consecutive  iterations  differ  by  the  same  constant',  which  we  shall  call  the 
input  interval.  The  input  interval  is  the  period  of  the  input  schedule  or, 
equivalently,  of  the  input  throughput,  as  well  as  of  the  output  schedule 
(recall  that  iterative  MCNs  are  temporally  regular).  It  determines  an  upper 
bound  on  the  rate  at  which  inputs  are  applied  to  the  network  (lower  rates 
are  permitted,  but  require  additional  buffering). 

The  time-space  diagram  of  an  iterative  MCN  corresponds  to  its  minimal 
schedule  and  is,  therefore,  periodic.  It  is  important  to  notice  that  the 
period  (=  input  interval)  is,  in  general,  shorter  than  the  time  required  to 
complete  the  execution  of  a  single  iteration  (=  the  iteration  delay).  This 
means  that  hardware  realizations  of  the  MCN  can  be  pipelined:  New  inputs 
can  be  applied  before  the  processing  of  previous  inputs  has  been  completed. 
The  following  section  provides  a  detailed  analysis  of  pipelinabil ity  in 
MCNs. 


4.2  HARDWARE  ARCHITECTURES 

The  functional  regularity  of  iterative  MCNs  implies  that  they  can  be 
implemented  in  special  purpose  VLSI  hardware  by  mapping  the  precedence  graph 
of  a  single  iteration  directly  into  silicon.  Each  processor  is  mapped  into 
a  cell  ('processing  element')  and  precedence  relations  are  mapped  into 
interconnections  between  cells.  Neither  translation  nor  hardware 
compilation  are  required  to  accomplish  this  mapping  since  the  hardware 


architecture  is  an  exact  image  of  a  single  layer  of  the  network 
architecture.  An  execution  is  now  interpreted  as  the  propagation  of  a 
sequence  of  wavefronts  through  the  hardware  rather  than  the  propagation 
of  a  single  activity  wavefront  through  the  iterative  MCN  (Figure  4-2).  The 
time  spacing  of  these  wavefronts  equals  the  period  of  the  underlying  MCN 
minimal  schedule. 

Since  a  single  layer  of  the  MCN  is  used  to  'simulate'  the  entire 
network  each  processor  is  activated  many  times  and  each  arc  of  the  hardware 
architecture  corresponds  to  a  time-series  of  variables  rather  than  to  a 
single  variable.  This  raises  a  design  problem  of  a  new  kind:  It  is 
necessary  to  guarantee  that  variables  do  not  disappear  before  they  have  been 
used  to  evaluate  their  successors.  There  are  three  different  solutions  to 
this  problem: 

(1)  Iterative  execution:  A  new  iteration  is  initiated  only  after  the 
execution  of  the  preceding  iteration  has  been  completed.  This 
means  that  the  input  interval  is  extended  (by  buffering  of 
intermediate  results)  to  the  length  of  the  iteration  delay,  and 
the  time-overlap  between  iterations  is  completely  eliminated. 

(2)  Scheduled  execution:  The  (minimal)  schedule  of  the  network  is 
determined  in  advance  and  execution  is  carried  out  according  to 
schedule.  Buffering  is  provided  to  guarantee  the  availability  of 
inputs  to  processors  on  schedule  (only  non-critical  variables  need 
to  be  buffered) . 

(3)  Self-timed  execution:  Processors  are  activated  as  soon  as  their 
inputs  become  available.  Acknowledgment  signals  ('hand-shaking') 
are  used  to  guarantee  the  correct  transfer  of  data  between 
processors . 

While  scheduled  execution  offers  the  shortest  execution  time  and  requires  a 
fairly  simple  control  system,  it  is  extremely  sensitive  to  scheduling 
perturbations.  Such  perturbations,  which  are  caused  by  clock-skewing  and 
local  variations  in  execution  times,  may  result  in  loss  of  synchronization 
between  cells  and  a  complete  failure  of  the  system.  Iterative  execution  is 
insensitive  to  scheduling  perturbations  and  requires  a  very  simple  control 
system,  but  wastes  processing  time  since  the  hardware  is  idle  most  of  the 
time.  Self-timed  execution  provides  a  nice  tradeoff  between  these  two 
extremes:  Its  execution  time  is  only  slightly  longer  than  the  theoretical 

minimum  achieved  by  scheduled  execution;  and  the  control  system  it  requires 
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Figure  4-: 


b.  Hardware  Architecture  Perspective 


Execution  Interpreted  as  Activity  Wavefront  Propagation 
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has  about  the  sane  complexity  as  the  timing  system  for  scheduled  execution. 
It  is  interesting  to  observe  that  the  conditions  for  self-timed 
execution  [11] ,  [12]  coincide  with  the  concept  of  admissible  composition, 
which  was  shown  to  be  the  necessary  and  sufficient  condition  for 
executabil i ty  in  general.  Thus,  every  MCN  can  be  implemented  as  a  self- 
timed  system. 

The  notion  of  self-timed  execution  suggests  the  introduction  of  self- 
timed  block-diagrams.  These  are  obtained  by  removing  the  delay-elements 
from  a  conventional  block-diagram  and  replacing  them  by  direct  connections. 
The  hardware  implementation  of  such  self-timed  diagrams  is  straightf orward 
provided  two  simple  rules  are  obeyed: 

(i)  Each  cell  is  activated  as  soon  as  all  its  inputs  become  available 
and  deactivated  as  soon  as  all  its  outputs  have  been  evaluated. 

(ii)  Each  input  variable  is  accompanied  by  an  acknowledgment  line. 

Each  input  port  (sink)  acknowledges  the  arrival  of  a  new  input 
variable  to  the  processor  that  evaluated  this  variable.  The 
acknowledgment  is  sent  when  the  processor  connected  to  the  input 
port  becomes  activated. 

These  rules  assume  that  each  cell  is  provided  with  sufficient  memory  to 
store  its  output  variables  until  they  become  acknowledged. 

The  acknowledgment  of  inputs  associated  with  self-timed  implementations 
can  (and  should)  be  reflected  in  the  space-time  diagram  of  the  network. 
Acknowledgment  signals  are  just  one  more  set  of  variables  in  the  network, 
and  are  represented  in  the  space-time  diagrams  by  arcs,  as  any  other 
variable.  For  instance,  a  cascade  connection  of  (identical)  processors 
(Figure  4-3a)  has  an  input  interval  of  x^  +  x^  wk*re  x^  is  the  execution 
time  of  the  processor  and  x^  ia  the  delay  between  the  reception  of  an 
acknowledgment  signal  from  the  subsequent  processor  and  the  transmission  of 
an  acknowledgment  signal  to  the  preceding  processor  (Figure  4-3b) .  The 
interval  x^  includes  the  propagation  time  through  the  processor  and  the 
connecting  wires  plus  the  time  required  to  carry  out  checks  on  the  input 
data  (parity,  error  detection,  fault  detection,  etc.).  Notice  that  the  need 
for  explicit  acknowledgment  can  be  eliminated  in  many  cases,  e.g.,  when 
there  is  an  information  carrying  path  along  the  cascade  in  the  backward 
direction. 


The  horizontal  dimension  of  space-time  diagrams  can  now  be  interpreted 
as  hardware:  Processors  located  along  the  same  horizontal  line 
(isochrone)  represent  computations  that  need  to  be  carried  out 
simultaneously  and  must,  therefore,  be  implemented  in  parallel  hardware.  We 
shall  adopt  the  convention  of  interpreting  the  vertical  dimension  of  space- 
time  diagrams  as  pure  time:  Processors  located  along  the  same  vertical  line 
will  represent  computations  that  are  carried  out  by  the  same  processing 
element  but  during  different  (non-overlapping)  intervals  of  time.  Thus,  for 
instance,  the  MCN  of  Figure  4-4  can  be  implemented  in  hardware  with  four 
processing  elements  (Figure  4-4b).  Each  vertical  column  of  processors  in 
the  space-time  diagram  of  the  MCN  (Figure  4-4a)  is  mapped  into  a  single 
hardware  cell;  connections  between  columns  are  mapped  into  physical 
connections  between  cells  and  connections  within  columns  are  implemented  by 
locally  storing  intermediate  results  inside  the  appropriate  cells.  However, 
the  correspondence  between  space-time  diagrams  and  hardware  block  diagrams 
is  not  always  so  straightforward.  For  instance,  the  various  representations 
of  the  IIR  filter  (Figures  3-4  through  3-6)  seem  to  indicate  that  the 
hardware  implementation  of  an  m-th  order  filter  requires  m  +  1  processing 
elements.  However,  we  also  observe  that  only  30%  of  these  elements  are 
active  at  any  given  instant  of  time.  Thus,  it  is  possible  to  cut  the  number 
of  processing  elements  by  half  without  affecting  the  schedule  at  all  (Figure 
4-5a) ,  In  general,  every  two  processors  p,  q  whose  schedules  satisfy 

x  ( q)  2  max  (t(x);  x  c  ^(p)]  (4.1) 

can  be  implemented  as  a  single  hardware  cell,  which  first  evaluates  the 
outputs  of  p,  then  the  outputs  of  q.  The  condition  (4.1)  guarantees 
that  the  computations  represented  by  the  processor  p  are  completed  before 
the  computations  represented  by  the  processor  q  are  started.  Since 
adjacency  of  processors  in  the  time-space  diagram  is,  in  general,  mapped 
into  spatial  adjacency  of  hardware  cells,  the  merging  of  non-adjacent 
processors  into  a  single  hardware  cell  will  require  a  complex  network  of 
interconnections  between  cells.  In  order  to  reduce  the  complexity  of  cell 
interconnections  to  a  minimum  only  adjacent  cells  can  be  considered  as 
candidates  for  merger.  The  most  frequent  example  of  such  merger  is  the 
interleaving  of  two  adjacent  columns,  as  in  Figure  4-5. 
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Self-timed  or  scheduled  execution  is*  indeed*  faster  than  iterative 
execution  only  if  the  input  interval  is  shorter  than  the  execution  time  of  a 
single  iteration.  An  implementation  of  such  an  execution  will  initiate  a 
new  iteration  before  the  execution  of  the  previous  iteration  has  been 
completed.  Such  implementations  deserve  to  be  called  pipelined.  Thus* 
iterative  executions  are  never  pipelined*  while  self-timed  (or  scheduled) 
executions  are  pipelined  only  for  pipelinable  MCNs. 

Notice  that  the  input  interval  is  uniquely  defined  for  every 
temporally-regular  MCN*  but  the  iteration  delay  (=  execution  time  of  a 
single  iteration)  depends  upon  the  partitioning  of  the  MCN  into  iterations. 
Since  this  partitioning  need  not  be  unique*  an  iterative  MCN  may  have 
several  hardware  realizations*  each  with  a  different  iteration  delay.  Thus* 
pipelining  is  primarily  a  property  of  a  given  hardware  realization.  An  MCN 
is  considered  pipelinable  if  it  has  at  least  one  pipelined  realization. 
Pipelinability  is  most  frequently  associated  with  completely  regular  MCNs 
(**  systolic-array-like  networks).  The  connection  between  complete 
regularity  and  pipelinability  is  discussed  in  the  following  section. 

4.3  COMPLETELY  REGULAR  MCNs 

An  iterative  MCN  is  called  completely  regular  if  it  satisfies  the  four 
following  conditions: 

(i)  All  processors  are  identical*  i.e.*  have  the  same  input-output 
map. 

(ii)  The  architecture  of  a  single  iteration  is  regular*  i.e.*  it  can 
be  represented  by  a  regular  multidimensional  grid  (also  known  as 
mosaic  in  the  2-D  case  [21]). 

(iii)  The  architecture  of  a  single  iteration  is  nested*  i.e.*  it  can 
be  considered  as  a  sequence  of  architectures  of  growing  size* 
each  being  obtained  from  the  previous  one  by  adding  cells  in  a 
regular  manner  (Figure  4*6). 

(iv)  The  minimal  schedule  of  the  network  is  regular.  This  means  that 
adding  the  time  dimension  to  the  regular  representation  of  a 
single  iteration  produces  a  regular  space-time  diagram. 


The  last  condition  is  important,  for  it  is  quite  possible  to  have  regular 
architectures  with  identical  processors  yet  irregular  schedules  (Figure  4- 
7).  A  network  of  the  form  described  in  Figure  4-7  is,  clearly,  not 
pipelinable:  The  input  interval  coincides  with  the  iteration  delay^  A 

completely  regular  network,  on  the  other  hand,  has  an  input  interval  that  is 
much  shorter  than  the  iteration  delay,  and  is,  therefore,  pipelinable. 

The  nestedness  of  a  completely  regular  architecture  is  a  purely  spatial 
attribute.  The  temporal  dimension  of  the  corresponding  space-time  diagram 
is  not  affected  when  the  size  of  the  nested  architecture  varies.  In 
particular,  the  input  interval  is  fixed  and  does  not  depend  upon  the  spatial 
extent  of  the  space-time  diagram.  The  iteration  delay,  on  the  other  hand, 
increases  as  the  spatial  extent  of  the  network  increases,  since  each 
iteration  involves  more  and  more  processors.  Consequently,  the  input 
interval  is  less  than  the  iteration  delay,  i.e.,  the  MCN  is  pipelinable. 

The  invariance  of  the  input  interval  with  order  (=  spatial  extent  of  the 
network)  is  sometimes  taken  as  the  definition  of  pipel inabil i ty  [9], 

However,  this  invariance  is  guaranteed  only  for  completely  regular  MCNs:  It 
is  possible  to  have  pipelinable  iterative  networks  whose  input  interval 
grows  (slowly)  with  order. 

A  completely  regular  MCN  determines  a  coordinate  basis  for  a  single 
iteration  in  a  natural  manner.  Adding  the  temporal  dimension  to  this 
spatial  basis  produces  a  set  of  coordinates  for  the  complete  MCN  (i.e.,  for 
its  space-time  diagram).  It  is  interesting  at  this  point  to  relate  this 
basis  to  the  more  abstract  definition  provided  in  Section  3.2  for  arbitrary 
MCNs.  That  definition  was  based  upon  the  notion  of  representative  families 
of  sequential  executions,  which  are  highly  nonunique.  The  natural 
coordinate  basis  associated  with  a  completely  regular  architecture  also 
provides  a  natural  and  unique  choice  of  a  representative  family.  This  is 
accomplished  by  considering  the  lexicographic  ordering  of  processors  induced 
by  the  basis.  The  coordinates  (e^,  ...,  e^) ,  which  for  a  completely 
regular  MCN  have  integral  values,  are  scanned  from  e^  to  efi.  This  means 
that  the  lexicographic  ordering  of  vertices  of  a  2-D  square  grid  of  size  3 
is:  11,  21,  31,  12,  22,  32,  13,  23,  33.  Each  cyclic  permutation  of  (e^, 

. ..,  en)  determines  one  sequential  execution,  and  the  totality  of  n 
sequential  executions  form  a  representative  family.  The  corresponding 
coordinate  basis  is  consistent  with  the  natural  basis  (e^,  ...,  efi) • 
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SECTION  5 
CONCLUSIONS 


A  modeling  and  analysis  methodology  for  parallel  algorithms  and 
architectures  has  been  presented.  Modular  computing  networks  (MCNs)  were 
introduced  as  a  unifying  concept  that  can  be  used  to  describe  both 
algorithms  and  architectures.  The  representation  of  an  MCN  exhibits  all  the 
relevant  information  that  characterizes  a  parallel  processing  algorithm, 
from  precedence  relations  and  order  of  execution,  through  scheduling  and 
pipel inabil ity  consideration,  to  map  compositions  and  global 
characterization.  It  clearly  displays  the  hierarchical  structure  of  a 
parallel  system  and  the  multiplicity  of  choices  for  hardware  implementation. 
Our  methodology  applies  both  to  arbitrary  (irregular)  networks  and  to 
iterative  ones.  Regularity  of  networks  translates  directly  into  regularity 
of  the  model  we  use  to  describe  them  and,  consequently ,  into  regularity  of 
the  associated  architectures.  Problems  of  non-executabil ity  (deadlocks, 
safeness,  etc.)  are  insignificant  in  our  methodology  and  can  be  easily 
detected  and  resolved. 

Infinite  MCNs,  which  occur  in  most  signal  processing  applications,  have 
been  characterized.  It  has  been  shown  that  the  key  property  for 
executabili ty  of  such  networks  is  structural  finiteness  (in  addition  to 
absence  of  cycles,  of  course).  Infinite  MCNs  are  most  frequently  iterative, 
in  which  case  they  are  guaranteed  to  be  structurally  finite  and  can  be 
represented  by  a  finite  single-layer  diagram. 

There  exist  three  distinct  modes  of  execution  for  iterative  networks: 
iterative,  scheduled  and  self-timed.  Iterative  execution  is  simple  but  slow 
and  storage-intensive.  Scheduled  execution  is  fast  but  sensitive  to 
schedule  perturbation  caused  by  clock  skewing.  Self-timed  execution  offers 
a  simple  and  robust  alternative  at  the  cost  of  introducing  handshake 
protocols  between  communicating  processors.  It  is  the  best  choice  for  large 
networks  containing  many  processors. 
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APPENDIX  A 

PROOF  OF  THEOREM  2.2  POR  INFINITE  MCNs 


If  the  MCN  has  an  execution  then  it  must  be  acyclic ,  as  was  pointed  out 
at  the  beginning  of  Section  2.3.  To  prove  the  converse  we  shall  construct 
an  execution  for  an  arbitrary  acyclic,  structural ly*-finitc  MCN. 

First,  notice  that,  by  Theorem  2.1,  the  input*  of  the  MCN  can  be 
numbered*  Let  us,  therefore,  denote  the  inputs  by  {x^;  0  <i  <  •} .  Next, 
recursively  define  a  sequence  of  sets  of  variables  {M^}  according  to  the 
following  rule: 


"o  (V 


'VV-  W 


Thus,  each  set  contains  one  new  inputs  of  the  MCN  and  all  the  immediate 
successors  of  the  preceding  set.  The  sets  M^  are  clearly  disjoint,  and, 
in  view  of  the  local-f initenes s  property,  each  M^  set  is  finite. 

Moreover,  every  variable  of  the  MCN  is  included  in  some  M^  set,  because 
every  variable  is  either  a  global  input  or  a  finite  successor  of  some  global 
input.  Thus,  the  cascade 


is,  in  fact,  a  representation  of  the  network  as  a  cascade  of  finite 
(disjoint)  subnetworks.  Each  M^  set  is  finite,  hence  has  an  execution 
with  a  finite  number  of  levels.  If  we  replace  each  M^  by  its  execution, 
we  obtain  a  refinement  of  the  previous  representation,  vix.. 


00 
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APPENDIX  B 


ADMISSIBLE  ARCHITECTURES 


A  composition  of  processors  is  called  admissible  if  the  following  three 
conditions  are  satisfied: 

(i)  There  are  no  dangling  inputs  or  outputs, 

(ill  There  are  no  directed  cycles. 

(iii)  The  architecture  is  structurally  finite. 

Each  of  the  processors  comprising  an  architecture  can  itself  be  a 
composition  of  more  elementary  processors.  The  hierarchical  nature  of  the 
admissibility  property  implies  the  following  result* 


Theorem  B.l 

An  admissible  composition  of  admissible  architectures  is  itself  an 
admissible  architecture. 


fepof : 


The  theorem  states  that  the  three  properties  making  up  admissibility 
should  be  exhibited  by  the  composite  architecture*  if  they  were  exhibited  by 
each  of  the  subnetworks. 

(i)  The  composite  architecture  has  no  dangling  terminals*  because 
every  terminal  is  connected  to  some  subnetwork  (by  admissibility 
of  the  composition)  and  no  subnetwork  has  dangling  terminals  (by 
admissibility  of  the  subnetworks). 

(ii)  The  composite  architecture  has  no  cycles  because  neither  the 
subnetwork  nor  the  composition  has  cycles. 
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(iii)  Structural  finiteness  is  made  up  of  the  three  following 
properties:  Local  finiteness,  finite  ancestries,  and 
countability  of  connected  components.  Local  finiteness  is 
inherited  by  the  composite  architecture  because  composition  does 
not  change  the  number  of  inputs/outputs  of  processors  within 
each  subnetwork.  To  prove  that  the  finite  ancestry  property  is 
also  inherited  by  the  composite  architecture  it  will  be 
sufficient  to  consider  a  single  variable  x.  Suppose  that  z 
belongs  to  some  subnetwork  By  the  admissibility  of  the 

composition,  <£.  has  a  finite  number  of  ancestor  subnetworks. 
The  ancestry  of1  z  is  obtained  by  tracing  back  the  ancestry 
relation  through  the  finite  collection  of  subnetworks  a('£.)» 

And  since  each  subnetwork  is  admissible,  the  portion  of  afz) 
within  each  ancestor  of  is  also  finite,  hence  a(z)  itself 

is  finite.  Finally,  an  admissible  composition  has  a  countable 
number  of  subnetworks  (see  Theorem  2.1)  and  each  subnetwork  has, 
by  assumption,  a  countable  number  of  connected  components. 

Hence,  the  total  number  of-  connected  components  in  the  composite 
network  is  countable,  too. 
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APPENDIX  C 
PROOF  OF  THEOREM  2.3 
MINIMAL  EXECUTIONS  OF  FINITE  MCNs 

Every  execution  determines  a  numbering  E(  )  of  the  variables  of  an 
MCN,  viz.^ 

x  e  Sj  < — >  E(x)  =  i 

This  integer  valued  function  satisfies  the  inequality  (see  Section  3.1) 
E(x)  -  1  2.  max  (E(y)  ;  y  e  a(x))  (C.l) 

y 

Every  finite  directed  acyclic  graph  has  a  unique  numbering  E(  )  of  its 
arcs  (or  equivalently  of  its  vertices)  that  satisfies  the  equality 

E(x)  -  1  =  max  (E(y) ;  y  e  a(x)}  (C.2) 

y 

This  well-known  result  (see,  e.g.»  [14])  implies  that  every  finite 
executable  MCN  has  a  unique  execution  that  satisfies  (C.2).  Ve  shall  call 
this  unique  execution  minimal  for  reasons  that  will  become  clear  in  the 
sequel • 

Let  E(  )  be  an  arbitrary  non-minimal  execution.  Then,  there  exists 
some  variable  x  for  which  the  strict  inequality 

E(x)  -  1  >  max  (E(y) ;  y  e  o(x)) 

y 

holds.  This  means  that  x  is  evaluated  several  steps  after  all  Its 
ancestors  became  available.  Consequently,  the  numbering  of  x  can  be 
modified  to  1  ♦  max  (E(y) ;  y  s  u(x))  without  violating  the  precedence 

y 
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relation.  We  shall  refer  in  the  sequel  to  this  modification  as  an 
elementary  shift. 

Each  execution  is  a  series-parallel  combination  and  consequently  has  a 
veil  defined  input-output  map.  Elementary  shifts  replace  expressions  of  the 
form  e*e#...*p  by  expressions  of  the  form  p*e*...*e  (see  Figure  C-l)  . 

If  the  physically  justifiable  identity 

p*e  -  e*p  (C.3) 

is  added  as  an  axion  of  the  theory  of  MCNs  (see  Appendix  D) ,  ve  conclude 
that  input-output  maps  remain  invariant  under  elementary  shifts.  This  leads 
to  the  fallowing  result. 


Theorem  C.l 


Every  execution  E(  )  of  a  finite  MCN  can  be  transformed  by  a  finite 
number  of  elementary  shifts  into  the  unique  minimal  execution. 


Proof: 

The  minimal  execution  E(  )  is  constructed  by  the  following  simple 
algorithm  (see,  e.g.,  [14]): 

(i)  Put  all  the  global  inputs  of  the  MCN  in  Sq. 

(ii)  For  ^i  *  0,1,2,...  put  all  the  immediate  successors  of  members 
of  S  j  in  *^i+l* 


1U 


Now,  if  E(  )  is  a  nomninimal  execution  we  transform  it  into  E(  )  by  the 
fol lowing  rule : 


For  i  *  0,1,2,*..  shift  all  members 
of  from  E(x)  to  E(x)  =  i. 


Since  the  MCN  is  finite,  a  finite  number  of  shifts  will  transform  E(  ) 
into  E(  ).  Notice  that  each  variable  is  shifted  exactly  once.  Also  notice 
that  by  its  construction,  the  number  E(x)  is  equal  to  the  lengths  of  the 

A 

shortest  path  connecting  x  to  some  global  input.  Hence,  E(x)  cannot  be 
further  reduced. 


Corollary  C.1.1 


x 


The  minimal  execution 
and  for  every  execution 


E(  )  satisfies 
E(  ). 


E(x)  <  E(x) 


for  every  variable 


Corollary  C.1.2 


A  finite  executable  MCN  has  a  unique  well-defined  input-output  map. 
This  is  so  because  all  executions  define  the  same  map,  by  Theorem  C.l. 


Proof  of  Theorem  2.3 


Corollary  C.1.2  establishes  the  theorem  for  finite  MCNs.  For  infinite 
networks  it  will  be  sufficient  to  prove  that  for  every  execution  E(  )  and 
for  every  variable  x  the  map  from  global  inputs  to  x  is  unique  and  does 
not  depend  upon  the  choice  of  execution.  However,  E(  )  induces  some 
execution  on  the  finite  MCN  corresponding  to  o(x) ,  the  finite  ancestry  of 
x.  Therefore,  the  map  from  inputs  to  x  coincides,  for  every  choice  of 
E(  ),  with  the  unique  map  determined  by  the  minimal  execution  on  a(x) • 
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It  it  interesting  to  notice  that  an  infinite  MCN  does  not  have,  in 
general ,  a  minimal  execution.  The  construction  procedure  described  in  the 
proof  of  Theorem  C.l  is  still  valid,  but  are,  in  general,  inf inlte  and 

do  not  determine  a  valid  execution. 


APPENDIX  D 

ELEMENTARY  EQUIVALENCE  TRANSFORMATIONS 


The  general  theory  of  MCNs  does  not  involve  any  specific  assumptions 
about  the  properties  of  the  processor  maps  {f^} .  Consequently,  there  are 
only  a  few  equivalence  transformations  that  are  still  valid  in  this  general 
framework.  Most  equivalence  transformations  used  with  block-diagrams  and 
signal-flow-graphs  involve  linearity  assumptions  and  do  not  hold  for  general 
nonlinear  maps. 

Two  basic  maps,  the  identity  map  e  and  the  split  map  s  can  be  used 
in  conjunction  with  any  MCN  manipulation.  The  identity  map  leaves  its  input 
variables  unchanged,  viz.. 


e(x)  =  x 


The  split  map  duplicates  input  variables,  viz.. 


s  (  X  )  -  (  X  ,  X  ) 


It  is  possible,  of  course,  to  have  more  than  two  copies  of  the  same 
variable,  either  by  introducing  a  split  processor  with  several  outputs,  or 
by  using  several  two-output  split  processors. 

The  properties  of  the  identity  and  split  processors  give  rise  to 
several  elementary  equivalence  transformations  (Figure  D-l) : 

(a)  The  identity  commutes  with  any  other  processor  f. 

(b)  The  cascade  of  a  processor  f  and  its  inverse  f  *  can  be 
replaced  by  an  identity  processor,  provided  the  processor  f  has 
an  inverse. 

(c)  The  split  processor  'commutes'  with  any  processor  f. 

(d)  Any  processor  f  with  two  outputs  can  be  replaced  by  a 
composition  of  a  split  processor  and  two  single  output  processors 
f  ,  ty  The  processors  f  ,  f  correspond  to  the  maps  from 
inputs  to  each  of  the  two  outputs,  respectively. 
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a.  Commutativity  of  the  Identity 


b.  The  Inverse  Processor 


c.  Commutativity  of  the  Split 


Splitting  of  Multivariable  Outputs 
-1.  Elementary  Equivalence  Transformations 
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APPENDIX  E 

ANALYSIS  OF  MATRIX  MULTIPLIERS 


The  multiplication  of  two  matrices  involves  the  computation  of  inner 
products  between  every  row  of  one  matrix  and  every  column  of  the  other  one. 
To  emphasize  this  interpretation  we  shall  consider  in  the  sequel  the  product 

C  :  =  A*B 

so  that  the  inner  products  are  between^  columns  of  *  A  and  columns  of  B.  In 
fact,  Ci .  is  precisely  the  inner  product  between  the  i-th  column  of  A 
and  the  j-th  column  of  B.  Consequently,  we  can  compute  the  product  by 
feeding  the  columns  of  A,B,  which  we  denote  by  a^,  b ^ ,  into  the  MCN  of 
Figure  E-l .  Each  input  is  a  column  vector  which  is  propagated  without 
modification  through  the  network.  The  a,b  inputs  of  each  processor 
propagate  through  without  modification  and  the  inner  product  of  the  two 
input  vectors  is  computed  inside  the  processor.  This  multichannel 
configuration  can  be  further  decomposed  by  observing  that  the  inner  products 
can  be  computed  recursively,  i.e.,  if  c  :  =  a*b  where  a  =  {a^} ,  b  c  10^} 
are  column  vectors  of  length  N,  then  c  =  c^  where 


:i  =  ci-l  +  °ipi'  co  =  0 


Thus,  every  single  processor  in  Figure  E-l  is,  in  fact,  a  cascade  of  basic 
'multiply  and  add'  processors  (Figure  E-2).  When  this  decomposition  is 
combined  with  the  architecture  of  Figure  E-l,  we  obtain  the  MCN  for  matrix 
multiplication.  Figure  E-3  displays  a  side  view  of  this  3-D  network  whose 
top  view  is  shown  in  Figure  E-l.  The  complete  MCN  consists  of  N 
horizontal  layers  such  as  in  Figure  E-l  arranged  in  a  vertical  stack. 
Equivalently,  we  may  say  the  MCN  consists  of  three  vertical  layers  such  as 
in  Figure  E-3  arranged  behind  each  other.  It  is  important  to  notice  that 


*•  The  Complete  Network 


Figure  E-l .  A  Basic  Matrix  Multiplie 


the  direction  of  the  C-path$  can  be  either  from  top  to  bottom,  as  shown  in 
Figure  E-3,  or  from  bottom  to  top*  This  is  a  consequence  of  the 
commute tiv ity  and  associativity  of  addition,  viz.* 


N  1 

£.  •  £.  “i*1! 


i=l 


i=N 


This  means  that  there  are  two  distinct  MCNs  that  correspond  to  matrix 
multiplication  and  they  differ  only  by  the  direction  of  the  C-paths. 

Every  architecture  for  matrix  multiplication  is  equivalent  to  the  MCN 
of  Figure  E-3.  The  various  architectures  are  obtained  by  imposing 
additional  constraints  upon  the  matrices  (i.e.,  bandedness)  and  rearranging 
the  resilTting  reduced  MCN  as  a  space-time  diagram.  The  corresponding  self- 
timed  block-diagram  follows  immediately  from  this  rearrangement. 

The  matrix  multiplier  of  S.Y.  Kung  [7]  is  obtained  by  interpreting  the 
vertical  dimension  in  Figure  E-3  as  'time.*  Since  vertical  arrows 
correspond  to  local  storage,  the  resulting  block-diagram  is  described  in 
Figure  E-4  (notice  the  similarity  with  E-l).  The  elements  of  each  column 
vector  a^,  b.  are  fed  sequentially  into  the  array  and  each  processor  has  a 
self-loop  which  computes  the  inner-product  c^  ■  a^  b^  recursively  in 
time . 


The  matrix  multiplier  of  S.  Rao  [10]  is  designed  for  a  banded  B 

matrix.  It  will  be  sufficient  to  analyze  it  for  a  single  column  of  A,  say 

a^.  The  MCN  of  Figure  E-3  now  has  only  one  vertical  layer,  and  many 

processors  in  this  layer  have  zero  inputs  and  can  be  eliminated.  The 

resulting  reduced  MCN  is  shown  in  Figure  E-5a.  Dummy  processors,  shown  in 

broken  line,  were  added  to  emphasize  the  tridiagonal  nature  of  the  MCN.  A 

self-timed  block-diagram  (Figure  E-5b)  is  obtained  by  considering  the 

diagonal  axis  as  'time.'  It  consists  of  a  linear  array  of  identical 

processors,  one  for  each  nonzero  diagonal  of  the  banded  matrix  B.  The 

elements  of  B  are  fed  into  the  array  by  diagonals.  The  elements  of  A,  C 

are  handled  by  columns:  Every  column  of  A  produces  a  row  of  C  and 

requires  a  linear  array  as  in  Figure  E-5b.  It  is  interesting  to  notice  that 

the  input  interval  of  this  matrix  multiplier  is  x  +  t  where  x  is  the 

c  a  c 


a.  Self-Time  Block-Diagram 


Figure  E-4 .  The  Matrix  Multiplier  of  S.T.  lung 


time  required  to  compute  'c#  and  is  the  time  required  to  propagate 

'a'  through  one  processor  When  the  direction  of  the  C-path  or, 
equivalently,  of  the  A-path,  is  reversed  the  input  interval  becomes 
x  -  x  .  Since  x  <<  x  the  two  networks  differ  only  slightly  in  their 
throughput.  However,  we  shall  presently  encounter  another  example  where  the 
reversal  of  the  C-path  results  in  a  large  increase  in  throughput. 

The  matrix  multiplier  of  H.T.  Kong  is  designed  for  banded  A,  B 
matrices.  This  means  that  the  active  processors  in  the  non-reduced  MCN  of 
Figures  E-l  and  E-3  are  located  within  a  parallelepiped  aligned  with  one  of 
the  main  diagonals  of  the  rectangular  prism  representing  the  non-reduced 
MCN.  A  simple  illustration  of  the  reduced  MCN  is  obtained  by  considering 
two  adjacent  horizontal  layers  (Figure  E-6) .  When  we  slide  the  horizontal 
layers  so  that  they  overlap,  the  resulting  network* corresponds  to  H.T. 

Kung's  multiplier  (Figure  E-7) .  This  network  clearly  has  an  input  interval 
of  xc  +  2t^.  However,  if  we  reverse  the  C-path  we  obtain  the  configuration 
of  Weiser  and  Davis  [4]  (Figure  E-8)  which  has  an  input  interval  of 

lx  -  2t  I .  The  difference  between  the  two  multipliers  is  significant  when 

c  a 

they  are  implemented  by  single  rate  systolic  arrays.  In  this  case 

r  *=  x  ~  x  so  that  the  former  network  has  an  input  interval  of  3x  while 
c  a 

the  latter  has  an  input  interval  of  x! 
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layer  i 


layer  i+1 


layer  i 


layer  i+1 
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a.  Side  View 


b.  Top  View 


Figure  E-8.  The  Matrix  Multiplier  of  Welter  and  Davit 
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